When I want to apply the same function to multiple columns, I have to write the name of the columns and map them to the same function one by one. This can become tedious when the number of columns are a big number. In the code below I map 3 column to the same function("first").
user_id = [12, 12, 13, 13, 13]
category = ["furniture", "furniture", "electronics","electronics","electronics"]
name = ["Casey", "Casey", "Alice", "Alice", "Alice"]
payment_amount = [96, 109, 56, 0, 90]
example_df = pd.DataFrame({"user_id" : user_id, "category" : category, "name" : name, "payment_amount": payment_amount})
expected_output = example_df.groupby("user_id").agg({"user_id" : "first", "category" : "first", "name" : "first", "payment_amount": sum})
Instead, I want to do something like this and get the same output:
expected_output = example_df.groupby("user_id").agg({["user_id" , "category" , "name"]: "first", "payment_amount": sum})
But this throws an error. How can this be done?
CodePudding user response:
You can generate dict:
d = {**{"payment_amount": 'sum'},
**dict.fromkeys(["user_id" , "category" , "name"], 'first')}
print (d)
{'payment_amount': 'sum', 'user_id': 'first', 'category': 'first', 'name': 'first'}
expected_output = example_df.groupby("user_id").agg(d)
More general solution should be:
d = dict.fromkeys(example_df.columns, 'first')
d['payment_amount'] = 'sum'
print (d)
{'user_id': 'first', 'category': 'first', 'name': 'first', 'payment_amount': 'sum'}
expected_output = example_df.groupby("user_id").agg(d)
CodePudding user response:
You can use a dictionary comprehension with an explicit list of columns for sum and first as default:
expected_output = (
example_df.groupby('user_id')
.agg({c: 'mean' if c in ('payment_amount') else 'first'
for c in example_df})
)
output:
user_id category name payment_amount
user_id
12 12 furniture Casey 102.500000
13 13 electronics Alice 48.666667
I'd however suggest to also use dtypes for selecting the columns:
expected_output = (
example_df.groupby('user_id')
.agg({k: 'mean' if v in ('int64', 'float64') else 'first'
for k,v in example_df.dtypes[1:].items()})
)
