I have a large pandas dataframe with this structure:
target query evalue GO
A0A089QRB9 sp|A0A089QRB9|MSL3_MYCTU 0.000000e 00 GO:1902321
A0A0C5DM37 sp|A0A089QRB9|MSL3_MYCTU 4.900000e-05 GO:0009820
A0A1B1FHP3 sp|A0A089QRB9|MSL3_MYCTU 1.000000e-04 GO:0005515\tGO:0005829\tGO:0009821\tGO:0042803
A4YGN2 sp|A0A089QRB9|MSL3_MYCTU 2.500000e-06 GO:0043958
B2HIL7 sp|A0A089QRB9|MSL3_MYCTU 2.000000e-257 GO:0008610\tGO:0034081\tGO:0071766
and I want to write this as a tab separated file, without escaping the last column (which is itself tab separated). I can do a split with expand=True to the "GO" column, and that will, of course, generate a dataframe with many empty values, as most entries in the "GO" column have few values. The problem is that when I write this to the file, I get a lot of "padding" with several tabs and no content, and this dramatically increases the size of the files.
For example, the first line will be encoded as:
A0A089QRB9\tsp|A0A089QRB9|MSL3_MYCTU\t0.000000e 00\tGO:1902321\t\t\t
(notice the trailing tabs)
Is there a way to write this in a way that "rstrips" the lines that are ultimately written to the file? At the moment I am joining all the columns into a single "to_write" column, and I only write that one column to the file.
df['to_write'] = df.apply(lambda x: '\t'.join(x.astype(str).values), axis=1)
df['to_write'].to_csv('my.tsv', index=False, header=False)
Note:
Please note that I can't simply call to_csv on the original dataframe, as I can't have the tab separation of the last column escaped. The final result must be a tab separated file with a variable number of columns.
CodePudding user response:
You can use numpy's savetxt:
import pandas as pd
import numpy as np
data = {
'target': ['A0A089QRB9', 'A0A0C5DM37', 'A0A1B1FHP3', 'A4YGN2', 'B2HIL7'],
'query': ['sp|A0A089QRB9|MSL3_MYCTU',
'sp|A0A089QRB9|MSL3_MYCTU',
'sp|A0A089QRB9|MSL3_MYCTU',
'sp|A0A089QRB9|MSL3_MYCTU',
'sp|A0A089QRB9|MSL3_MYCTU'],
'evalue': [0.0, 4.9e-05, 0.0001, 2.5e-06, 2e-257],
'GO': ['GO:1902321',
'GO:0009820',
'GO:0005515\tGO:0005829\tGO:0009821\tGO:0042803',
'GO:0043958',
'GO:0008610\tGO:0034081\tGO:0071766']
}
df = pd.DataFrame(data)
np.savetxt('my.tsv', df.to_numpy(), delimiter='\t', fmt='%s')
Output of sed 'y/\t/■/' my.tsv:
A0A089QRB9■sp|A0A089QRB9|MSL3_MYCTU■0.0■GO:1902321
A0A0C5DM37■sp|A0A089QRB9|MSL3_MYCTU■4.9e-05■GO:0009820
A0A1B1FHP3■sp|A0A089QRB9|MSL3_MYCTU■0.0001■GO:0005515■GO:0005829■GO:0009821■GO:0042803
A4YGN2■sp|A0A089QRB9|MSL3_MYCTU■2.5e-06■GO:0043958
B2HIL7■sp|A0A089QRB9|MSL3_MYCTU■2e-257■GO:0008610■GO:0034081■GO:0071766
