I am trying to reformat a DataFrame into a single line item per categorical group, but my fixed format needs to retain all elements of data associated to the category as new columns.
for example I have a DataFrame:
dta = {'day':['A','A','A','A','B','C','C','C','C','C'],
'param1':[100,200,2,3,7,23,43,98,1,0],
'param2':[1,20,65,3,67,2,3,98,654,5]}
df = pd.DataFrame(dta)
I need to be able to transform/reformat the DataFrame where the data is grouped by the 'day' column (e.g. one row per day) but then has columns generated dynamically according to how many entries are within each category.
For example category C in the 'day' column has 5 entries, meaning for 'day' C you would have 5 param1 values and 5 param2 values.
The associated values for days A and B would be populated with NaN or empty where they do not have entries.
e.g.
dta2 = {'day':['A','B','C'],
'param1_1':[100,7,23],
'param1_2':[200,np.nan,43],
'param1_3':[2,np.nan,98],
'param1_4':[3,np.nan,1],
'param1_5':[np.nan,np.nan,0],
'param2_1':[1,67,2],
'param2_2':[20,np.nan,3],
'param2_3':[65,np.nan,98],
'param2_4':[3,np.nan,654],
'param2_5':[np.nan,np.nan,5]
}
df2 = pd.DataFrame(dta2)
Unfortunately this is a predefined format that I have to maintain.
I am aiming to use Pandas as efficiently as possible to minimise deconstructing and reassembling the DataFrame.
CodePudding user response:
You first need to melt, then add a helper columns to cumcount the labels per group and pivot:
df2 = (
df.melt(id_vars='day')
.assign(group=lambda d: d.groupby(['day', 'variable']).cumcount().add(1).astype(str))
.pivot(index='day', columns=['variable', 'group'], values='value')
)
df2.columns = df2.columns.map('_'.join)
df2 = df2.reset_index()
output:
day param1_1 param1_2 param1_3 param1_4 param1_5 param2_1 param2_2 param2_3 param2_4 param2_5
0 A 100.0 200.0 2.0 3.0 NaN 1.0 20.0 65.0 3.0 NaN
1 B 7.0 NaN NaN NaN NaN 67.0 NaN NaN NaN NaN
2 C 23.0 43.0 98.0 1.0 0.0 2.0 3.0 98.0 654.0 5.0
