I have a dataframe that tracks the changes of an object which is identified by an id. Instead of each row representing a change of state, I want one row for each object and all of the changes tracked in columns instead.
import pandas as pd
import numpy as np
df1=pd.DataFrame({'ID':['1','2','3','1','2','1','4'], 'Original_Status':['Admitted','Admitted','Admitted','Probation','LateAdmission','Admitted','Admitted'],'New_Status':['Probation','LateAdmission','Pass','Admitted','Pass','Pass','Fail']})
df2=pd.DataFrame({'ID':['1','2','3','4'],'Original_Status_1':['Admitted','Admitted','Admitted','Admitted'],'New_Status_1':['Probation','LateAdmission','Pass','Fail'],'Original_Status_2':['Probation','LateAdmission',np.nan,np.nan],'New_Status_2':['Admitted','Pass',np.nan,np.nan],'Original_Status_3':['Admitted',np.nan,np.nan,np.nan],'New_Status_3':['Pass',np.nan,np.nan,np.nan],})`
ID Original_Status New_Status
0 1 Admitted Probation
1 2 Admitted LateAdmission
2 3 Admitted Pass
3 1 Probation Admitted
4 2 LateAdmission Pass
5 1 Admitted Pass
6 4 Admitted Fail
Change to:
ID Original_Status_1 New_Status_1 Original_Status_2 New_Status_2 Original_Status_3 New_Status_3
0 1 Admitted Probation Probation Admitted Admitted Pass
1 2 Admitted LateAdmission LateAdmission Pass NaN NaN
2 3 Admitted Pass NaN NaN NaN NaN
3 4 Admitted Fail NaN NaN NaN NaN
I was able to achieve this outcome using a loops, but I'd prefer a more succint solution if possible.
CodePudding user response:
This adds a columns to df1 to count the occurrence of the 'ID', then uses pd.pivot to make the wide df with a multi-index columns. The steps after the pivot are to flatten the column names and to order them correctly
df1['occurrence'] = df1.groupby('ID').cumcount()
df2 = df1.pivot(
index='ID',
values=['Original_Status','New_Status'],
columns='occurrence',
)
df2.columns = [s '_' str(o 1) for s,o in df2.columns]
c_order = sorted(df2.columns, key = lambda c: c[-1]) #re-order the columns
df2 = df2[c_order]
df2
