Home > Blockchain >  Collapsing rows based on condition in pandas
Collapsing rows based on condition in pandas

Time:02-03

I have a dataframe like so:

Input:

Restaurant     IP_Id  IP_Time                 S_Code
McDonalds      NaN    2021-03-28 03:01:05     4
McDonalds      NaN    2021-03-28 03:01:05     5
McDonalds      NaN    2021-03-28 03:01:05     6
McDonalds      101    2021-03-28 03:01:05     4
McDonalds      101    2021-03-28 03:01:05     5
Wendys         101    2021-03-28 03:01:05     4
Wendys         101    2021-03-28 03:01:05     4

I want to transform the dataframe to the following. For every instance where the IP_Id is the same for a specific restaurant at a specific time, I want to collapse the row and include a comma separated value for the S_Code.

Output:

Restaurant     IP_Id  IP_Time                 S_Code
McDonalds      NaN    2021-03-28 03:01:05     4
McDonalds      NaN    2021-03-28 03:01:05     5
McDonalds      NaN    2021-03-28 03:01:05     6
McDonalds      101    2021-03-28 03:01:05     4, 5
Wendys         101    2021-03-28 03:01:05     4

I've tried grouping by Restaurant, IP_Id, and IP_Time, but it moves the records with Nan as the IP_Id

CodePudding user response:

It's actually quite simple. You'll need a special condition to separate the NaNs into their own groups, which you can generate with a combination of shift and cumsum, and then call agg('unique'):

g = df.groupby([df['IP_Id'].ne(df['IP_Id'].shift(1)).cumsum(), df['Restaurant'], df['IP_Time']])
df = g['S_Code'].agg('unique').str.join(', ').reset_index().assign(IP_Id=g['IP_Id'].first().reset_index(drop=True))

Output:

>>> df
   IP_Id Restaurant             IP_Time S_Code
0    NaN  McDonalds 2021-03-28 03:01:05      4
1    NaN  McDonalds 2021-03-28 03:01:05      5
2    NaN  McDonalds 2021-03-28 03:01:05      6
3  101.0  McDonalds 2021-03-28 03:01:05   4, 5
4  101.0     Wendys 2021-03-28 03:01:05      4
  •  Tags:  
  • Related