One-to-Many Merge on 2 Columns Pandas-CodePudding

I am trying to consolidate 2 columns to prevent duplicates. From this question, I am able to do it for one column like so:

df = df.groupby(['Date', 'Ticket ID', 'Score', 'many other Columns...'])['Work Order'].apply(', '.join).reset_index()

Which gives me a nice output of:

Date                        Ticket ID           Work Order
2018-08-30 22:52:25         1444008             119846184
2021-09-29 13:33:49         1724734             122445397, 122441551

Now, I have at least one other column that I would like to do the same too, while keeping the Work Orders correct. However, repeating the process/code for a different column either seems to drop the original or not aggregate the columns like I want. I want:

Date                        Ticket ID           Work Order Num             Work Order ID
2018-08-30 22:52:25         1444008             119846184                  A6zH
2021-09-29 13:33:49         1724734             122445397, 122441551       H43a1, JU8a

I have tried a couple different things with the reset_index and merge functions but cannot seem to get it to do what I want. This seems like it should be pretty simple - what am I missing? It is very possible that I will have to do this for other columns as well, so I would like a repeatable solution.

# tried some df1 and df2 stuff - along with different locations of reset index.
df = df.groupby(['Reported Date', 'Site','Ticket ID', 'TicketUID', 'Work Order UID', \
                         'cols...'])['Work Order'].apply(', '.join)
        
df = df.groupby(['Reported Date', 'Site','Ticket ID', 'TicketUID', 'Work Order', \
                             'cols...'])['Work Order UID'].apply(', '.join)
        
# tried on='Ticket ID' here as well.
#df = pd.merge(df1, df2)

CodePudding user response：

apply works differently on DataFrameGroupBy (multiple grouped columns) and on SeriesGroupBy (single grouped column). For single column it works on elements, for multiple columns it applies function to each column.

So one way to do what you want to do is simply do apply twice - f.e.:

data = {'id':[3763058, 3763058, 3763058, 3763077, 3763077, 3763078], 
        'id2':[3763056, 3763056, 3763056, 3763074, 3763074, 3763073], 
        'item1' : ['item1', 'item2', 'item3', 'item4', 'item5', 'item6'],
        'item2' : ['itemA', 'itemB', 'itemC', 'itemD', 'itemE', 'itemF']}

df = pd.DataFrame(data)

df.groupby(['id', 'id2'])[['item1', 'item2']]\
        .apply(lambda x: x.apply(', '.join))
        .reset_index()

Which gives us result:

    id      id2     item1               item2
0   3763058 3763056 item1, item2, item3 itemA, itemB, itemC
1   3763077 3763074 item4, item5        itemD, itemE
2   3763078 3763073 item6               itemF