Pandas dataframe manipulation/re-sizing of a single-column count file-CodePudding

I have a file that looks like this:

gRNA_A
gene_a
140626
gene_b
227598
gene_c
115781
gRNA_B
gene_a
125003
gene_b
102000
gene_c
200300

I want to read this into a pandas dataframe and re-shape it so that it looks like this:

        gene_a gene_b gene_c
gRNA_A  140626 227598 115781
gRNA_B  125003 102000 200300

Is this possible? If so, how?

Notes: it will not always be this size, so the solution needs to be size-independent. The input file will be max ~200gRNAs x 20genes. There will be gRNA_somelettercombos, but the gene will not be named gene_lettercombo-- the gene will be the name of an actual gene (like GAPDH, ACTB, etc.).

CodePudding user response：

Not sure if this is the cleanest way, but this works for the given example.

I created a file data.txt with provided sample.

I assumed the count is always a number.

def file_parser(f_path):
    data_dict = {}
    my_gRNA = None
    my_gene = None
    with open(f_path, "r") as f:
        for each in f:
            if not each:
                continue
            each = each.strip()
            if each.startswith("gRNA"):
                if each not in data_dict:
                    data_dict[each] = {}
                my_gRNA = each
            elif not each.isnumeric() and isinstance(each, str) and not each.startswith("gRNA"):
                my_gene = each
            elif each.isnumeric():
                data_dict[my_gRNA][my_gene] = each
        return data_dict
    

df = pd.DataFrame.from_dict(file_parser("data.txt"), orient='index')

df.head()
        gene_a  gene_b  gene_c
gRNA_A  140626  227598  115781
gRNA_B  125003  102000  200300

Note: This answer is very similar to the one by mozway. The only difference is in the parser, where I explicitly check for numeric types.

CodePudding user response：

You need to write a parser for your custom format, relying on the gRNA string to start a new group and then taking odd elements as key and even as value:

d = {}
current_rRNA = None
gene = None

with open('gRNA.txt') as f:
    for line in f:                    # iterate over lines
        line = line.strip()
        if not line:                  # skip blank lines
            continue
        if line.startswith('gRNA_'):  # start new group
            current_rRNA = line
            d[current_rRNA] = {}
        else:
            if gene:                  # even line of a group = data
                d[current_rRNA][gene] = int(line)
                gene = None
            else:                     # odd line of a group = gene name
                gene = line

df = pd.DataFrame.from_dict(d, orient='index')

output:

        gene_a  gene_b  gene_c
gRNA_A  140626  227598  115781
gRNA_B  125003  102000  200300