Home > database >  Pandas DataFrame, group by column into single line items but extend columns by number of occurrences
Pandas DataFrame, group by column into single line items but extend columns by number of occurrences

Time:01-25

I am trying to reformat a DataFrame into a single line item per categorical group, but my fixed format needs to retain all elements of data associated to the category as new columns.

for example I have a DataFrame:

dta = {'day':['A','A','A','A','B','C','C','C','C','C'],
       'param1':[100,200,2,3,7,23,43,98,1,0], 
       'param2':[1,20,65,3,67,2,3,98,654,5]}

df = pd.DataFrame(dta)

I need to be able to transform/reformat the DataFrame where the data is grouped by the 'day' column (e.g. one row per day) but then has columns generated dynamically according to how many entries are within each category.

For example category C in the 'day' column has 5 entries, meaning for 'day' C you would have 5 param1 values and 5 param2 values.

The associated values for days A and B would be populated with NaN or empty where they do not have entries.

e.g.

dta2 = {'day':['A','B','C'],
   'param1_1':[100,7,23],
   'param1_2':[200,np.nan,43],
   'param1_3':[2,np.nan,98], 
   'param1_4':[3,np.nan,1],
   'param1_5':[np.nan,np.nan,0], 
   'param2_1':[1,67,2], 
   'param2_2':[20,np.nan,3],
   'param2_3':[65,np.nan,98], 
   'param2_4':[3,np.nan,654],
   'param2_5':[np.nan,np.nan,5]
  
  }

df2 = pd.DataFrame(dta2)

Unfortunately this is a predefined format that I have to maintain.

I am aiming to use Pandas as efficiently as possible to minimise deconstructing and reassembling the DataFrame.

CodePudding user response:

You first need to melt, then add a helper columns to cumcount the labels per group and pivot:

df2 = (
 df.melt(id_vars='day')
   .assign(group=lambda d: d.groupby(['day', 'variable']).cumcount().add(1).astype(str))
   .pivot(index='day', columns=['variable', 'group'], values='value')
)

df2.columns = df2.columns.map('_'.join)
df2 = df2.reset_index()

output:

  day  param1_1  param1_2  param1_3  param1_4  param1_5  param2_1  param2_2  param2_3  param2_4  param2_5
0   A     100.0     200.0       2.0       3.0       NaN       1.0      20.0      65.0       3.0       NaN
1   B       7.0       NaN       NaN       NaN       NaN      67.0       NaN       NaN       NaN       NaN
2   C      23.0      43.0      98.0       1.0       0.0       2.0       3.0      98.0     654.0       5.0
  •  Tags:  
  • Related