I'm fairly new to Pandas, and I therefore don't know if there is a smart an efficient way of doing the following --- e.g., some merge-like operation that I don't know. I hope some of you can help.
Say I have two dataframes, where each row contains 1) an ID unique to that dataframe, and 2) a list of items held by that ID. E.g.:
df1 = pd.DataFrame(data={'id': ['ID1','ID2'], 'items': [['a','b'],['a','b']]})
df2 = pd.DataFrame(data={'id': ['ID1','ID3'], 'items': [['a','c'],['a','d']]})
I would like to combine these dataframes to get the combined information of df1 and df2, so obtain:
df3 = pd.DataFrame(data={'id': ['ID1','ID2','ID3'], 'items': [['a','b','c'],['a','b'],['a','d']]})
Hence, df3 is df1 but 1) where ID1 holds also the item c from df2 2) has been appended with the row containing ID3.
How may I go about that?
CodePudding user response:
First idea is flatten list before outer merge by DataFrame.explode, then reshape by DataFrame.melt with remove duplicates and missing values, last aggregate lists:
df = (df1.explode('items').merge(df2.explode('items'), on="id", how="outer")
.melt('id')
.drop_duplicates(['id','value'])
.dropna(subset=['value'])
.groupby('id')['value'].agg(list)
.reset_index(name='items')
)
print (df)
id items
0 ID1 [a, b, c]
1 ID2 [a, b]
2 ID3 [a, d]
Or is possible Series.explode after merge:
df = (df1.merge(df2, on="id", how="outer")
.set_index('id')
.stack()
.explode()
.groupby(level=0)
.agg(lambda x: list(set(x)))
.reset_index(name='items'))
print (df)
id items
0 ID1 [c, b, a]
1 ID2 [b, a]
2 ID3 [d, a]
CodePudding user response:
Probably not the most beautiful solution, but maybe it helps:
df_final = df1.merge(df2, on="id", how="outer")
df_final["items_x"] = df_final["items_x"].apply(lambda x: x if isinstance(x, list) else [])
df_final["items_y"] = df_final["items_y"].apply(lambda x: x if isinstance(x, list) else [])
df_final["items"] = (df_final["items_x"] df_final["items_y"]).apply(lambda x: list(set(x)))
df_final = df_final.drop(["items_x", "items_y"], axis=1)
df_final
Output:
id items
0 ID1 [c, a, b]
1 ID2 [a, b]
2 ID3 [d, a]
