This is the second part of this question.
Suppose I have a dataframe df and I want to select x1 and x100, corresponding to the largest amount, grouped by group_id. If there are multiple rows with the largest amount, then I want to select medians of x1 and x100.
df = pd.DataFrame({'group_id' : [1,1,1,2,2,3,3,3,3],
'amount' : [2,4,5,1,2,3,5,5,5],
'x1':[2,5,8,3,6,9,3,1,0],
'x100':[1,2,3,4,8,9,9,4,5]})
group_id amount x1 x100
0 1 2 2 1
1 1 4 5 2
2 1 5 8 3
3 2 1 3 4
4 2 2 6 8
5 3 3 9 9
6 3 5 3 9
7 3 5 1 4
8 3 5 0 5
So the desired output looks like this:
median_x1 median_x100
group_id
1 8.0 3.0
2 6.0 8.0
3 1.0 5.0
For only 2 columns (x1 and x100), I can simply add 1 line to @AndrejKesely solution to the previous question, something like this:
out = df.groupby("group_id").apply(
lambda x: pd.Series(
{"median_x1": (d := x.loc[x["amount"] == x["amount"].max()])['x1'].median(),
"median_x100": d["x100"].median()}
)
)
How to do this in a neat way, which will work for 100 columns, i.e., x1, x2 up to x100? Ideally, I do not want to copypaste one line 100 times and manually changing name of a column in an editor...
CodePudding user response:
Maybe something like this?
df.groupby('group_id').apply(
lambda x: x[x['amount'] == x['amount'].max()
].drop(columns=['amount', 'group_id']).median())
You can also use names of columns instead of .drop():
df.groupby('group_id').apply(
lambda x: x.loc[x['amount'] == x['amount'].max(), ['x1', 'x100']].median())
