Given a certain table of type
| A | B | C |
|---|---|---|
| t | r | 1 |
| t | r | 1 |
| n | j | 2 |
| n | j | 2 |
| n | j | 2 |
I would like to group on A and B and only take the number of rows specified by C
So the desired output would be
| A | B | C |
|---|---|---|
| t | r | 1 |
| n | j | 2 |
| n | j | 2 |
I am trying to achieve that through this function but with no luck
df.groupby(['A', 'B']).agg(lambda x: x.head(df.C))
CodePudding user response:
You can use groupby.cumcount and boolean indexing:
out = df[df['C'].gt(df.groupby(['A', 'B']).cumcount())]
Or with a classical groupby.apply:
(df.groupby(['A', 'B'], sort=False, as_index=False, group_keys=False)
.apply(lambda g: g.head(g['C'].iloc[0]))
)
output:
A B C
0 t r 1
2 n j 2
3 n j 2
Intermediates for the groupby.cumcount approach:
A B C cumcount C > cumcount
0 t r 1 0 True
1 t r 1 1 False
2 n j 2 0 True
3 n j 2 1 True
4 n j 2 2 False
CodePudding user response:
The following seems to work and is based on GroupBy.nth:
df[::-1].groupby(['A', 'B'], as_index=False).nth(df['C'])
Output:
A B C
3 n j 2
2 n j 2
0 t r 1
