I want to introduce the contents of one df to another but as a list based on ID. I know to merge based on ID but I do not want duplicate rows for ID in the new dataframe. How do I get this done?
data1 = {'ID': ['AB01','AB02'],
'Name': ["toyota", "honda"],
'Age':[21,22]
}
df1 = pd.DataFrame.from_dict(data1)
data2 = {'ID': ['AB01','AB01','AB03','AB03'],
'Type': ["C",np.nan,"X","S"],
'Score':[87,98,45,82]
}
df2 = pd.DataFrame.from_dict(data2)
The result should look like this
CodePudding user response:
Try merge:
print(df1.merge(df2, on='ID', how='left').groupby(['ID', 'Name', 'Age']).apply(lambda x: a.to_dict('records') if (a:=x[['ID']].join(x.iloc[:, 3:])).dropna().any().any() else []).reset_index(name='Info'))
Output:
ID Name Age Info
0 AB01 toyota 21 [{'ID': 'AB01', 'Type': 'C', 'Score': 87.0}, {...
1 AB02 honda 22 []
CodePudding user response:
You can make dict on the rows of df2 by .apply(), then group by ID and aggregate the dict of same ID into list by .groupby() .agg().
Then, merge with df1 with .merge() by left join with ID as matching keys, as follows:
df2_info = (df2.apply(dict, axis=1)
.groupby(df2['ID'])
.agg(list)
.reset_index(name='Info')
)
df_out = df1.merge(df2_info, on='ID', how='left')
Result
print(df_out)
ID Name Age Info
0 AB01 toyota 21 [{'ID': 'AB01', 'Type': 'C', 'Score': 87}, {'ID': 'AB01', 'Type': nan, 'Score': 98}]
1 AB02 honda 22 NaN
For reference only, interim result of df2_info:
ID Info
0 AB01 [{'ID': 'AB01', 'Type': 'C', 'Score': 87}, {'ID': 'AB01', 'Type': nan, 'Score': 98}]
1 AB03 [{'ID': 'AB03', 'Type': 'X', 'Score': 45}, {'ID': 'AB03', 'Type': 'S', 'Score': 82}]


