I have a file that looks like this:
gRNA_A
gene_a
140626
gene_b
227598
gene_c
115781
gRNA_B
gene_a
125003
gene_b
102000
gene_c
200300
I want to read this into a pandas dataframe and re-shape it so that it looks like this:
gene_a gene_b gene_c
gRNA_A 140626 227598 115781
gRNA_B 125003 102000 200300
Is this possible? If so, how?
Notes: it will not always be this size, so the solution needs to be size-independent. The input file will be max ~200gRNAs x 20genes. There will be gRNA_somelettercombos, but the gene will not be named gene_lettercombo-- the gene will be the name of an actual gene (like GAPDH, ACTB, etc.).
CodePudding user response:
Not sure if this is the cleanest way, but this works for the given example.
I created a file data.txt with provided sample.
I assumed the count is always a number.
def file_parser(f_path):
data_dict = {}
my_gRNA = None
my_gene = None
with open(f_path, "r") as f:
for each in f:
if not each:
continue
each = each.strip()
if each.startswith("gRNA"):
if each not in data_dict:
data_dict[each] = {}
my_gRNA = each
elif not each.isnumeric() and isinstance(each, str) and not each.startswith("gRNA"):
my_gene = each
elif each.isnumeric():
data_dict[my_gRNA][my_gene] = each
return data_dict
df = pd.DataFrame.from_dict(file_parser("data.txt"), orient='index')
df.head()
gene_a gene_b gene_c
gRNA_A 140626 227598 115781
gRNA_B 125003 102000 200300
Note: This answer is very similar to the one by mozway. The only difference is in the parser, where I explicitly check for numeric types.
CodePudding user response:
You need to write a parser for your custom format, relying on the gRNA string to start a new group and then taking odd elements as key and even as value:
d = {}
current_rRNA = None
gene = None
with open('gRNA.txt') as f:
for line in f: # iterate over lines
line = line.strip()
if not line: # skip blank lines
continue
if line.startswith('gRNA_'): # start new group
current_rRNA = line
d[current_rRNA] = {}
else:
if gene: # even line of a group = data
d[current_rRNA][gene] = int(line)
gene = None
else: # odd line of a group = gene name
gene = line
df = pd.DataFrame.from_dict(d, orient='index')
output:
gene_a gene_b gene_c
gRNA_A 140626 227598 115781
gRNA_B 125003 102000 200300
