Home > Enterprise >  pandas, "strip" remove whitespace when writing to file
pandas, "strip" remove whitespace when writing to file

Time:01-29

I have a large pandas dataframe with this structure:

    target                    query        evalue                                          GO
A0A089QRB9 sp|A0A089QRB9|MSL3_MYCTU  0.000000e 00                                     GO:1902321
A0A0C5DM37 sp|A0A089QRB9|MSL3_MYCTU  4.900000e-05                                     GO:0009820
A0A1B1FHP3 sp|A0A089QRB9|MSL3_MYCTU  1.000000e-04 GO:0005515\tGO:0005829\tGO:0009821\tGO:0042803
    A4YGN2 sp|A0A089QRB9|MSL3_MYCTU  2.500000e-06                                     GO:0043958
    B2HIL7 sp|A0A089QRB9|MSL3_MYCTU 2.000000e-257             GO:0008610\tGO:0034081\tGO:0071766

and I want to write this as a tab separated file, without escaping the last column (which is itself tab separated). I can do a split with expand=True to the "GO" column, and that will, of course, generate a dataframe with many empty values, as most entries in the "GO" column have few values. The problem is that when I write this to the file, I get a lot of "padding" with several tabs and no content, and this dramatically increases the size of the files.

For example, the first line will be encoded as:

A0A089QRB9\tsp|A0A089QRB9|MSL3_MYCTU\t0.000000e 00\tGO:1902321\t\t\t

(notice the trailing tabs)

Is there a way to write this in a way that "rstrips" the lines that are ultimately written to the file? At the moment I am joining all the columns into a single "to_write" column, and I only write that one column to the file.

df['to_write'] = df.apply(lambda x: '\t'.join(x.astype(str).values), axis=1)
df['to_write'].to_csv('my.tsv', index=False, header=False)

Note: Please note that I can't simply call to_csv on the original dataframe, as I can't have the tab separation of the last column escaped. The final result must be a tab separated file with a variable number of columns.

CodePudding user response:

You can use numpy's savetxt:

import pandas as pd
import numpy as np

data = {
 'target': ['A0A089QRB9', 'A0A0C5DM37', 'A0A1B1FHP3', 'A4YGN2', 'B2HIL7'],
 'query': ['sp|A0A089QRB9|MSL3_MYCTU',
  'sp|A0A089QRB9|MSL3_MYCTU',
  'sp|A0A089QRB9|MSL3_MYCTU',
  'sp|A0A089QRB9|MSL3_MYCTU',
  'sp|A0A089QRB9|MSL3_MYCTU'],
 'evalue': [0.0, 4.9e-05, 0.0001, 2.5e-06, 2e-257],
 'GO': ['GO:1902321',
  'GO:0009820',
  'GO:0005515\tGO:0005829\tGO:0009821\tGO:0042803',
  'GO:0043958',
  'GO:0008610\tGO:0034081\tGO:0071766']
}
df = pd.DataFrame(data)

np.savetxt('my.tsv', df.to_numpy(), delimiter='\t', fmt='%s')

Output of sed 'y/\t/■/' my.tsv:

A0A089QRB9■sp|A0A089QRB9|MSL3_MYCTU■0.0■GO:1902321
A0A0C5DM37■sp|A0A089QRB9|MSL3_MYCTU■4.9e-05■GO:0009820
A0A1B1FHP3■sp|A0A089QRB9|MSL3_MYCTU■0.0001■GO:0005515■GO:0005829■GO:0009821■GO:0042803
A4YGN2■sp|A0A089QRB9|MSL3_MYCTU■2.5e-06■GO:0043958
B2HIL7■sp|A0A089QRB9|MSL3_MYCTU■2e-257■GO:0008610■GO:0034081■GO:0071766
  •  Tags:  
  • Related