I am trying to summarize my data. Below you can see my example of data
data = {'id':['1','2','3','4','5'],
'name': ['Company1', 'Company1', 'Company3', 'Company4', 'Company5'],
'employee': [10.2, 3.5, 5, 1, 0],
'sales': [10, 10, 55, 1, 0],
'streets':['1a', '3b', 5, 1, 0],
}
df = pd.DataFrame(data, columns = ['id','name', 'employee','sales','streets'])
Now I want to summarize all numerical values from my dataset. I summarize with lines below:
df = df.groupby(['id','name',]).agg({'employee':sum,
'sales':sum,
}).reset_index()
But this is only a small example, usually, I have a dataset with around 10 numerical values to summarize so can anybody help me how to aggregate all of this data in a simple way?
I try with this code below
df = df.groupby(['id','name',]).agg({df.select_dtypes(include = ['float64', 'int64'])
}).reset_index()
but I received error unhashable type: 'DataFrame'
So can anybody help me how to solve this problem?
CodePudding user response:
GroupBy.sum shall aggregate all numeric columns (and only those) by default so
df.groupby(['id', 'name'], as_index=False).sum()
id name employee sales
0 1 Company1 10.2 10
1 2 Company1 3.5 10
2 3 Company3 5.0 55
3 4 Company4 1.0 1
4 5 Company5 0.0 0
as_index=False substitutes for the subsequent reset_index().
(Your current error is because you're trying to make a set (with {...}) out of a DataFrame but its unhashability makes it fail.)
CodePudding user response:
df.select_dtypes returns a subset of the df's columns based on the column dtypes, so you can groupby the relevant columns of df (in this example, "id" and "name" columns) and apply a function to this subset:
out = df.select_dtypes(include = ['float64', 'int64']).groupby([df['id'],df['name']]).sum()
Output:
employee sales
id name
1 Company1 10.2 10
2 Company1 3.5 10
3 Company3 5.0 55
4 Company4 1.0 1
5 Company5 0.0 0
CodePudding user response:
Maybe a bit late but if you want to use agg, you can also use:
numeric_cols = df.columns[df.dtypes.isin([float, int])]
out = df.groupby(['id', 'name'], as_index=False)[numeric_cols] \
.agg({'employee': sum, 'sales': sum})
Output:
>>> out
id name employee sales
0 1 Company1 10.2 10
1 2 Company1 3.5 10
2 3 Company3 5.0 55
3 4 Company4 1.0 1
4 5 Company5 0.0 0
>>> numeric_cols
Index(['employee', 'sales'], dtype='object')
