Home > Enterprise >  Pandas: Aggregating and transposing dataframe based on string
Pandas: Aggregating and transposing dataframe based on string

Time:01-29

I have a dataframe that tracks the changes of an object which is identified by an id. Instead of each row representing a change of state, I want one row for each object and all of the changes tracked in columns instead.

import pandas as pd
import numpy as np
df1=pd.DataFrame({'ID':['1','2','3','1','2','1','4'], 'Original_Status':['Admitted','Admitted','Admitted','Probation','LateAdmission','Admitted','Admitted'],'New_Status':['Probation','LateAdmission','Pass','Admitted','Pass','Pass','Fail']})

df2=pd.DataFrame({'ID':['1','2','3','4'],'Original_Status_1':['Admitted','Admitted','Admitted','Admitted'],'New_Status_1':['Probation','LateAdmission','Pass','Fail'],'Original_Status_2':['Probation','LateAdmission',np.nan,np.nan],'New_Status_2':['Admitted','Pass',np.nan,np.nan],'Original_Status_3':['Admitted',np.nan,np.nan,np.nan],'New_Status_3':['Pass',np.nan,np.nan,np.nan],})`

    ID  Original_Status     New_Status
0   1   Admitted            Probation
1   2   Admitted            LateAdmission
2   3   Admitted            Pass
3   1   Probation           Admitted
4   2   LateAdmission       Pass
5   1   Admitted            Pass
6   4   Admitted            Fail

Original Dataframe

Change to:

    ID  Original_Status_1  New_Status_1  Original_Status_2   New_Status_2  Original_Status_3  New_Status_3
0   1   Admitted           Probation     Probation           Admitted      Admitted           Pass
1   2   Admitted           LateAdmission LateAdmission       Pass          NaN                NaN
2   3   Admitted           Pass          NaN                 NaN           NaN                NaN
3   4   Admitted           Fail          NaN                 NaN           NaN                NaN

New Dataframe

I was able to achieve this outcome using a loops, but I'd prefer a more succint solution if possible.

CodePudding user response:

This adds a columns to df1 to count the occurrence of the 'ID', then uses pd.pivot to make the wide df with a multi-index columns. The steps after the pivot are to flatten the column names and to order them correctly

df1['occurrence'] = df1.groupby('ID').cumcount()
df2 = df1.pivot(
    index='ID',
    values=['Original_Status','New_Status'],
    columns='occurrence',
)

df2.columns = [s '_' str(o 1) for s,o in df2.columns]
c_order = sorted(df2.columns, key = lambda c: c[-1]) #re-order the columns
df2 = df2[c_order]
df2
  •  Tags:  
  • Related