Home > Software design >  Pandas dataframe manipulation/re-sizing of a single-column count file
Pandas dataframe manipulation/re-sizing of a single-column count file

Time:01-28

I have a file that looks like this:

gRNA_A
gene_a
140626
gene_b
227598
gene_c
115781
gRNA_B
gene_a
125003
gene_b
102000
gene_c
200300

I want to read this into a pandas dataframe and re-shape it so that it looks like this:

        gene_a gene_b gene_c
gRNA_A  140626 227598 115781
gRNA_B  125003 102000 200300

Is this possible? If so, how?

Notes: it will not always be this size, so the solution needs to be size-independent. The input file will be max ~200gRNAs x 20genes. There will be gRNA_somelettercombos, but the gene will not be named gene_lettercombo-- the gene will be the name of an actual gene (like GAPDH, ACTB, etc.).

CodePudding user response:

Not sure if this is the cleanest way, but this works for the given example.

I created a file data.txt with provided sample.

I assumed the count is always a number.

def file_parser(f_path):
    data_dict = {}
    my_gRNA = None
    my_gene = None
    with open(f_path, "r") as f:
        for each in f:
            if not each:
                continue
            each = each.strip()
            if each.startswith("gRNA"):
                if each not in data_dict:
                    data_dict[each] = {}
                my_gRNA = each
            elif not each.isnumeric() and isinstance(each, str) and not each.startswith("gRNA"):
                my_gene = each
            elif each.isnumeric():
                data_dict[my_gRNA][my_gene] = each
        return data_dict
    

df = pd.DataFrame.from_dict(file_parser("data.txt"), orient='index')
df.head()
        gene_a  gene_b  gene_c
gRNA_A  140626  227598  115781
gRNA_B  125003  102000  200300

Note: This answer is very similar to the one by mozway. The only difference is in the parser, where I explicitly check for numeric types.

CodePudding user response:

You need to write a parser for your custom format, relying on the gRNA string to start a new group and then taking odd elements as key and even as value:

d = {}
current_rRNA = None
gene = None

with open('gRNA.txt') as f:
    for line in f:                    # iterate over lines
        line = line.strip()
        if not line:                  # skip blank lines
            continue
        if line.startswith('gRNA_'):  # start new group
            current_rRNA = line
            d[current_rRNA] = {}
        else:
            if gene:                  # even line of a group = data
                d[current_rRNA][gene] = int(line)
                gene = None
            else:                     # odd line of a group = gene name
                gene = line

df = pd.DataFrame.from_dict(d, orient='index')

output:

        gene_a  gene_b  gene_c
gRNA_A  140626  227598  115781
gRNA_B  125003  102000  200300
  •  Tags:  
  • Related