I have a csv file that looks like this:
CCC1=NNC(CC)=C1CC(=O)N[C@H]1C[C@@](C)(O)C1 Z3670357625 M
COC(=O)C1(CNC(=O)NCC2CC2(C)C)CCCC1 Z1366118392 M
CC1(C)CN(CCF)CCN1S(=O)(=O)C1=CNC=N1 Z3847735140 S
CC1=CC(N2CCC(C#N)(C3CCC3)CC2)=NC=C1C#N Z4174841496 M
CN(CC1=NN=C2C=CC=CN12)C(=O)C1=CC=CC=C1Br Z979185880 S
CC1=CC(C(NCC(C)N(C)C2CC2)C2CC2)=CC=C1F Z1586839055 S
CN1N=NC2=CC(CSCC3=CC=C(Br)C=C3)=CC=C21 Z1449657345 S
CCC(C)(C)C(=O)N(C)C1=CC=CC(F)=C1OC Z4888674994 S
CCC(NC1CC2CCCCN2C1=O)C1=NOC(C)=C1 Z4196015222 M
COCC1(CO)CCN(CC2=CC=CC(C(=O)O)=C2Cl)C1 Z4251400790 M
O=C(C[C@H]1CCNC1)N1CCC(OC2=CC=C(Cl)C=C2)C1 Z4062618130 S
O=C(C[C@@H]1CCNC1)N1CCC(OC2=CC=C(Cl)C=C2)C1 Z4062618465 S
CC1(C)CN(C2CCN(C(=O)OC3COC3)C2)C1 Z4000118072 M
CCC(CC)C(C)C(=O)OCC1=CC(=O)NC=C1 Z2861190319 S
COC1=CC(Br)=C(CN[C@@H]2C[C@@H]3C[C@H](O)[C@H]2C3)C=C1OC |&1:9,13,&2:11,15| Z3808914614 S
The last line contains characters within ||, which leaves it with four columns instead of the three that the other rows have. How do I delete just |&1:9,13,&2:11,15|? I saw this post: Deleting part of a string pandas DataFrame but I need to delete just this selection, not the entire row starting from this point. I feel this would be easier than trying to just retain the three columns I need throughout the csv, but am open to other ideas. I will need to do this for a number of lines in a very large file. Thanks!
CodePudding user response:
Assuming that your content is in a csv file called csvfile.csv try the following:
- Code to replace the offending string
#read input file
fin = open("csvfile.csv", "rt")
#read file contents to string
data = fin.read()
#replace all occurrences of the required string
data = data.replace('|&1:9,13,&2:11,15|', '')
#close the input file
fin.close()
#open the input file in write mode
fin = open("csvfile.csv", "wt")
#overrite the input file with the resulting data
fin.write(data)
#close the file
fin.close()
after this you should be able to do:
- Code to simply read the csv
import pandas as pd
df = pd.read_csv("csvfile.csv")
CodePudding user response:
A much faster way is to skip python altogether. From your terminal:
sed 's/|&1:9,13,&2:11,15|//g' path_to_csv > path_to_modified_csv
This will remove exclusively the undesired part, e.g. (last line):
COC1=CC(Br)=C(CN[C@@H]2C[C@@H]3C[C@H](O)[C@H]2C3)C=C1OC Z3808914614 S
