What would be the correct way to return n random max values from a groupby?
I have a dataframe containing audio events, with the following columns:
- audio
- start_time
- end_time
- duration
- labelling confidence (1 to 5)
- label ("Ambulance", "Engine", ...)
I have multiple events/rows for each label and I have 26 labels in total.
What I would like to achieve is to get one event per label with max confidence.
Let's say we have 7 events that have label "Ambulance" and they have the following labelling confidence: 2, 5, 5, 4, 4, 3, 5.
The max confidence is 5 in this case, which gives us 3 selectable events. I would like to get one of the three at random.
Doing the following with pandas: df.groupby("label").max() will return the first row with max labelling confidence. I would like it to be a random selection.
Many thanks in advance
Cheers
Antoine
CodePudding user response:
Edit: following a comment from the OP, the simplest solution is to shuffle the data frame before picking the max rows:
(
df.sample(frac=1)
.sort_values(['label', 'confidence'], ascending=False, kind='stable')
.groupby('label')
.head(1)
)
CodePudding user response:
Here is how I finally managed to do it:
shuffled_df = df.sample(frac=1)
filtered_df = shuffled_df.loc[shuffled_df.groupby("label")["confidence"].idxmax()]
