Apply Same Aggregation on Multiple Columns when Using Groupby (python)-CodePudding

When I want to apply the same function to multiple columns, I have to write the name of the columns and map them to the same function one by one. This can become tedious when the number of columns are a big number. In the code below I map 3 column to the same function("first").


user_id = [12, 12, 13, 13, 13]
category = ["furniture", "furniture", "electronics","electronics","electronics"]
name = ["Casey", "Casey", "Alice", "Alice", "Alice"]
payment_amount = [96, 109, 56, 0, 90]
example_df = pd.DataFrame({"user_id" : user_id, "category" : category, "name" : name, "payment_amount": payment_amount})

expected_output = example_df.groupby("user_id").agg({"user_id" : "first", "category" : "first", "name" : "first", "payment_amount": sum})

Instead, I want to do something like this and get the same output:

expected_output = example_df.groupby("user_id").agg({["user_id" , "category" , "name"]: "first", "payment_amount": sum})

But this throws an error. How can this be done?

CodePudding user response：

You can generate dict:

d = {**{"payment_amount": 'sum'}, 
     **dict.fromkeys(["user_id" , "category" , "name"], 'first')}

print (d)
{'payment_amount': 'sum', 'user_id': 'first', 'category': 'first', 'name': 'first'}

expected_output = example_df.groupby("user_id").agg(d)

More general solution should be:

d = dict.fromkeys(example_df.columns, 'first')
d['payment_amount'] = 'sum'
print (d)
{'user_id': 'first', 'category': 'first', 'name': 'first', 'payment_amount': 'sum'}

expected_output = example_df.groupby("user_id").agg(d)

CodePudding user response：

You can use a dictionary comprehension with an explicit list of columns for sum and first as default:

expected_output = (
 example_df.groupby('user_id')
           .agg({c: 'mean' if c in ('payment_amount') else 'first'
                for c in example_df})
)

output:

         user_id     category   name  payment_amount
user_id                                             
12            12    furniture  Casey      102.500000
13            13  electronics  Alice       48.666667

I'd however suggest to also use dtypes for selecting the columns:

expected_output = (
 example_df.groupby('user_id')
           .agg({k: 'mean' if v in ('int64', 'float64') else 'first'
                 for k,v in example_df.dtypes[1:].items()})
)