Home > Enterprise >  Grouping by a column and assigning different aggregation function for a certain column that is used
Grouping by a column and assigning different aggregation function for a certain column that is used

Time:02-23

I have a df:

field           category            2022-01-10      2022-01-17      2022-01-24    2022-01-31 
A               Chair               50              100             200           150
B               Chair               75              15              57            75   
C               Chair               0.57            0.64            0.4           0.47
A               Table               50              100             200           150
B               Table               75              15              57            75   
C               Table               0.57            0.64            0.4           0.47
A               Mirror              50              100             200           150
B               Mirror              75              15              57            75   
C               Mirror              0.57            0.64            0.4           0.47

I am trying to group by field or simply drop the category column and aggregate the values for each week so that the result would look like this:

field       2022-01-10      2022-01-17      2022-01-24    2022-01-31 
A           50 50 50        100 100 100     200 200 200   150 150 150
B           75 75 75        15 15 15        57 57 57      75 75 75
C           np.mean([0.57, 0.57, 0.57])     # each weeks average of field C

For fields A & B each week I would sum up the values for that week in different categories, but for field C I want the average.

I tried: df.groupby(['field']).agg({'A':sum, 'B':sum, 'C'.np.mean}) but this of course does not work because the fields arethe thing that's being grouped by and not the column.

I believe this problem can be sorted by somehow dropping an category column and then transforming the values based on field column but I am not sure how that works.

CodePudding user response:

Create a dict of functions to apply then flat your dataframe and group by field and variable (dates) columns. Use the first index of groupby to get the right function. Finally reshape your dataframe.

aggfuncs = {'A': lambda x: np.sum(x),
            'B': lambda x: np.sum(x),
            'C': lambda x: np.mean(x)}

out = df.melt(['field', 'category']).groupby(['field', 'variable'])['value'] \
        .apply(lambda x: aggfuncs[x.name[0]](x)).unstack() \
        .rename_axis(columns=None).reset_index()
print(out)

# Output
  field  2022-01-10  2022-01-17  2022-01-24  2022-01-31
0     A      150.00      300.00       600.0      450.00
1     B      225.00       45.00       171.0      225.00
2     C        0.57        0.64         0.4        0.47

CodePudding user response:

{A, B} and {C} require different handling, so we can handle them separately

pd.concat([
    df[df['field'].isin(['A', 'B'])).groupby('field').sum(), #for AB
    df[df['field']=='C'].groupby('field').mean(),  #for C
])

CodePudding user response:

Here is one way to approach the problem:

s = df.set_index('field').select_dtypes(np.number)

out = s.drop('C').groupby(level=0).sum()
out.loc['C', :] = s.loc['C'].mean()

Result:

print(out)

       2022-01-10  2022-01-17  2022-01-24  2022-01-31
field                                                
A          150.00      300.00       600.0      450.00
B          225.00       45.00       171.0      225.00
C            0.57        0.64         0.4        0.47
  • Related