My pandas dataframe has a column(series) with comma separated string values looking like this:
col_1,col_2,col_3
abc,123,49
bcd,234,"39,48"
I want to convert this col_3 into a list of integers. I tried
df["col_3"]=[[i] for i in df["col_3"]]
But I end up getting this result
col_1,col_2,col_3
abc,123,['49']
bcd,234,"['39,48']"
The desired output is
col_1,col_2,col_3
abc,123,[49]
bcd,234,[39,48]
I also want it to be fast because I would be using this for batches of 100k rows. Can some one suggest a solution . TIA
col_1 col_2 col_3
0 F18 9994 495
1 C35 9995 390,483
3 F60 9998 497,468
CodePudding user response:
You can use pandas.Series.str.split with a list comprehension :
df['col_3'] = [[int(e) for e in x.split(",")] for x in df['col_3']]
# Output :
print(df)
col_1 col_2 col_3
0 abc 123 [49]
1 bcd 234 [39, 48]
print(type(df.loc[0, 'col_3'][0]))
int
# Edit :
If you need to save your df to a .csv, there will be always double quotes between all the lists of integers (with a length > 1) since the elements of a list is separated by a comma and the default separator of pandas.DataFrame.to_csv is the comma as well. Double quotes are there to escape the commas inside these lists. So, to get rid of the double quotes, you have to choose another separator like the tab for example :
df.to_csv(r'path_to_your_new_csv', sep='\t')
CodePudding user response:
I am piggybacking on abokey's answer because they are already on the right track. Please let me know my dummy dataframe is not in your required format.
import pandas as pd
df = pd.DataFrame({"col1":["F18", "C35"], "col2":[9994, 9995], "col3":[43, '"39,48"']})
Using str.split would seperate your string elements into a list of substrings, based on your delimiter. he benefir of the appraoch is the whole string is put in a list, when the delimiter cannot be found as in df.loc[0, "col3"]. Afterwards, we apply list comprehension to change dtype of each element from str to int since it is your preferred format.
df["col3"] = df["col3"].apply(lambda x: [int(i) for i in str(x).strip('"').split(",")])
Edit: I updated the values based on your screenshot. use tqdm to see the progress when you are running on 100k rows.
!pip install tqdm
from tqdm import tqdm
tqdm.pandas()
df["col3"] = df["col3"].progress_apply(lambda x: [int(i) for i in str(x).strip('"').split(",")])
I can recommend pandarallel to parallel process through rows but it does not always work in my experience.
!pip install pandarallel
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True, nb_workers=4)
df["col3"] = df["col3"].parallel_apply(lambda x: [int(i) for i in str(x).strip('"').split(",")])

