Home > Blockchain >  Sample random row from df.groupby("column1")["column2].max() and not first one if mul
Sample random row from df.groupby("column1")["column2].max() and not first one if mul

Time:01-29

What would be the correct way to return n random max values from a groupby?

I have a dataframe containing audio events, with the following columns:

  • audio
  • start_time
  • end_time
  • duration
  • labelling confidence (1 to 5)
  • label ("Ambulance", "Engine", ...)

I have multiple events/rows for each label and I have 26 labels in total.

What I would like to achieve is to get one event per label with max confidence.

Let's say we have 7 events that have label "Ambulance" and they have the following labelling confidence: 2, 5, 5, 4, 4, 3, 5.

The max confidence is 5 in this case, which gives us 3 selectable events. I would like to get one of the three at random.

Doing the following with pandas: df.groupby("label").max() will return the first row with max labelling confidence. I would like it to be a random selection.

Many thanks in advance

Cheers

Antoine

CodePudding user response:

Edit: following a comment from the OP, the simplest solution is to shuffle the data frame before picking the max rows:

(
    df.sample(frac=1)
      .sort_values(['label', 'confidence'], ascending=False, kind='stable')
      .groupby('label')
      .head(1)
)

CodePudding user response:

Here is how I finally managed to do it:

shuffled_df = df.sample(frac=1)

filtered_df = shuffled_df.loc[shuffled_df.groupby("label")["confidence"].idxmax()]

  •  Tags:  
  • Related