Home > Net >  Aggregating data with automatic selection
Aggregating data with automatic selection

Time:02-08

I am trying to summarize my data. Below you can see my example of data

data = {'id':['1','2','3','4','5'],
        'name': ['Company1', 'Company1', 'Company3', 'Company4', 'Company5'], 
        'employee': [10.2, 3.5, 5, 1, 0], 
        'sales': [10, 10, 55, 1, 0], 
        'streets':['1a', '3b', 5, 1, 0], 
       }
df = pd.DataFrame(data, columns = ['id','name', 'employee','sales','streets'])

Now I want to summarize all numerical values from my dataset. I summarize with lines below:

df = df.groupby(['id','name',]).agg({'employee':sum,
                                    'sales':sum,
                                    }).reset_index()   

But this is only a small example, usually, I have a dataset with around 10 numerical values to summarize so can anybody help me how to aggregate all of this data in a simple way?

I try with this code below

df = df.groupby(['id','name',]).agg({df.select_dtypes(include = ['float64', 'int64'])
                                    }).reset_index()   

but I received error unhashable type: 'DataFrame'

So can anybody help me how to solve this problem?

CodePudding user response:

GroupBy.sum shall aggregate all numeric columns (and only those) by default so

df.groupby(['id', 'name'], as_index=False).sum()

  id      name  employee  sales
0  1  Company1      10.2     10
1  2  Company1       3.5     10
2  3  Company3       5.0     55
3  4  Company4       1.0      1
4  5  Company5       0.0      0

as_index=False substitutes for the subsequent reset_index().

(Your current error is because you're trying to make a set (with {...}) out of a DataFrame but its unhashability makes it fail.)

CodePudding user response:

df.select_dtypes returns a subset of the df's columns based on the column dtypes, so you can groupby the relevant columns of df (in this example, "id" and "name" columns) and apply a function to this subset:

out = df.select_dtypes(include = ['float64', 'int64']).groupby([df['id'],df['name']]).sum()

Output:

             employee  sales
id name                     
1  Company1      10.2     10
2  Company1       3.5     10
3  Company3       5.0     55
4  Company4       1.0      1
5  Company5       0.0      0

CodePudding user response:

Maybe a bit late but if you want to use agg, you can also use:

numeric_cols = df.columns[df.dtypes.isin([float, int])]
out = df.groupby(['id', 'name'], as_index=False)[numeric_cols] \
        .agg({'employee': sum, 'sales': sum})

Output:

>>> out
  id      name  employee  sales
0  1  Company1      10.2     10
1  2  Company1       3.5     10
2  3  Company3       5.0     55
3  4  Company4       1.0      1
4  5  Company5       0.0      0

>>> numeric_cols
Index(['employee', 'sales'], dtype='object')
  •  Tags:  
  • Related