I need help with combining language columns into one row, and then drop duplicate columns, just combine two different language of the same Movie, year, and Id.
There are more similar columns in the CSV, so please help me figure out a way to combine those.Here is the existing csv:
f = pd.DataFrame({'Movie': ['name1','name1','name2','name3','name4','name4'],
'Year': ['1905', '1905','1906','1907','1910','1910'],
'Id': ['tt0283985', 'tt0283985','tt0284043','tt3402904','tt3458360','tt3458360'],
'language':['Mandarian','Cantonese','Mandarian','unknown','Cantonese','Mandarian']})
Where f now looks like:
Movie Year Id language
0 name1 1905 tt0283985 Mandarian
1 name1 1905 tt0283985 Cantonese
2 name2 1906 tt0284043 Mandarian
3 name3 1907 tt3402904 unknown
4 name4 1910 tt3458360 Cantonese
5 name4 1910 tt3458360 Mandarian
And the result should be like this:
Movie Year Id language
0 name1 1905 tt0283985 Mandarian,Cantonese
1 name2 1906 tt0284043 Mandarian
2 name3 1907 tt3402904 unknown
3 name4 1910 tt3458360 Cantonese,Mandarian
So row 1 and 2 are identical except language, we just need to combine them into one row, so does row 5 and 6. Here's my try:
ff = new_f.groupby(by = ['Movie','Year','Id']).agg(','.join)
ff.to_csv("File.csv", index=False)
But the output is weird:
All other columns disappeared and language is the only column left:
language
Mandarian,Cantonese
Mandarian
unknown
Cantonese,Mandarian
...
CodePudding user response:
By default, groupby sets the grouping keys as index, and you explicitly asked to_csv not to export the index,
Use as_index=False in groupby:
ff = f.groupby(by=['Movie','Year','Id'], as_index=False).agg(','.join)
ff.to_csv("File.csv", index=False)
Or, export the index in to_csv:
ff = f.groupby(by=['Movie','Year','Id']).agg(','.join)
ff.to_csv("File.csv")
NB. if the intermediate is not useful to you, you do not need to set the ff variable, you can directly chain f.groupby(...).agg(...).to_csv(...)
CodePudding user response:
Use the following command: ff = f.groupby(['Movie','Year','Id']).agg(','.join).reset_index()
It should work.
