Python Pandas How to get rid of groupings with only 1 row?-CodePudding

In my dataset, I am trying to get the margin between two values. The code below runs perfectly if the fourth race was not included. After grouping based on a column, it seems that sometimes, there will be only 1 value, therefore, no other value to get a margin out of. I want to ignore these groupings in that case. Here is my current code:

import pandas as pd

data = {'Name':['A', 'B', 'B', 'C', 'A', 'C', 'A'], 'RaceNumber': 
[1, 1, 2, 2, 3, 3, 4], 'PlaceWon':['First', 'Second', 'First', 'Second', 'First', 'Second', 'First'], 'TimeRanInSec':[100, 98, 66, 60, 75, 70, 75]}

df = pd.DataFrame(data)

print(df)

def winning_margin(times):
    times = list(times)
    winner = min(times)
    times.remove(winner)
    return min(times) - winner

winning_margins = df[['RaceNumber', 'TimeRanInSec']] \
                .groupby('RaceNumber').agg(winning_margin)
winning_margins.columns = ['margin']

winners = df.loc[df.PlaceWon == 'First', :]
winners = winners.join(winning_margins, on='RaceNumber')

avg_margins = winners[['Name', 'margin']].groupby('Name').mean()
avg_margins

CodePudding user response：

How about returning a NaN if times does not have enough elements:

import numpy as np
def winning_margin(times):
    if len(times) <= 1: # New code
        return np.NaN   # New code
    times = list(times)
    winner = min(times)
    times.remove(winner)
    return min(times) - winner

your code runs with this change and seem to produce sensible results. But you can furthermore remove NaNs later if you want eg in this line

winning_margins = df[['RaceNumber', 'TimeRanInSec']] \
                .groupby('RaceNumber').agg(winning_margin).dropna() # note the addition of .dropna()

CodePudding user response：

You could get the winner and margin in one step:

 def get_margin(x):
        if len(x) < 2:
            return np.NaN
        
        i = x['TimeRanInSec'].idxmin()
        nl = x['TimeRanInSec'].nsmallest(2)
        margin = nl.max()-nl.min()
    
        return [x['Name'].loc[i], margin]

Then:

df.groupby('RaceNumber').apply(get_margin).dropna()

RaceNumber
1    [B, 2]
2    [C, 6]
3    [C, 5]

(the data has the 'First' indicator corresponding to the slower time in the data)