Merge/union Python Pandas DataFrames with column of unique row IDs and column of lists to DataFrame-CodePudding

I'm fairly new to Pandas, and I therefore don't know if there is a smart an efficient way of doing the following --- e.g., some merge-like operation that I don't know. I hope some of you can help.

Say I have two dataframes, where each row contains 1) an ID unique to that dataframe, and 2) a list of items held by that ID. E.g.:

df1 = pd.DataFrame(data={'id': ['ID1','ID2'], 'items': [['a','b'],['a','b']]}) 
df2 = pd.DataFrame(data={'id': ['ID1','ID3'], 'items': [['a','c'],['a','d']]})

I would like to combine these dataframes to get the combined information of df1 and df2, so obtain:

df3 = pd.DataFrame(data={'id': ['ID1','ID2','ID3'], 'items': [['a','b','c'],['a','b'],['a','d']]})

Hence, df3 is df1 but 1) where ID1 holds also the item c from df2 2) has been appended with the row containing ID3.

How may I go about that?

CodePudding user response：

First idea is flatten list before outer merge by DataFrame.explode, then reshape by DataFrame.melt with remove duplicates and missing values, last aggregate lists:

df =  (df1.explode('items').merge(df2.explode('items'), on="id", how="outer")
           .melt('id')
           .drop_duplicates(['id','value'])
           .dropna(subset=['value'])
           .groupby('id')['value'].agg(list)
           .reset_index(name='items')
          )
print (df)
    id      items
0  ID1  [a, b, c]
1  ID2     [a, b]
2  ID3     [a, d]

Or is possible Series.explode after merge:

df =  (df1.merge(df2, on="id", how="outer")
          .set_index('id')
          .stack()
          .explode()
          .groupby(level=0)
          .agg(lambda x: list(set(x)))
          .reset_index(name='items'))
print (df)
    id      items
0  ID1  [c, b, a]
1  ID2     [b, a]
2  ID3     [d, a]

CodePudding user response：

Probably not the most beautiful solution, but maybe it helps:

df_final = df1.merge(df2, on="id", how="outer")

df_final["items_x"] = df_final["items_x"].apply(lambda x: x if isinstance(x, list) else [])
df_final["items_y"] = df_final["items_y"].apply(lambda x: x if isinstance(x, list) else [])

df_final["items"] = (df_final["items_x"]   df_final["items_y"]).apply(lambda x: list(set(x)))

df_final = df_final.drop(["items_x", "items_y"], axis=1)

df_final

Output:

    id      items
0   ID1     [c, a, b]
1   ID2     [a, b]
2   ID3     [d, a]