I have a data frame, train_df:
data = {'driverID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
'raceID': [2, 3, 4, 5, 6, 2, 3, 4, 5, 6],
'race_date': ['2018-04-20', '2018-06-20', '2018-08-20', '2018-10-20', '2019-01-20',
'2018-04-20', '2018-06-20', '2018-08-20', '2018-10-20', '2019-01-20'],
'win': [1, 0, 0, 1, 1, 0, 1, 1, 0, 0]}
train_df = pd.DataFrame(data)
driverID raceID race_date win
0 1 2 2018-04-20 1
1 1 3 2018-06-20 0
2 1 4 2018-08-20 0
3 1 5 2018-10-20 1
4 1 6 2019-01-20 1
5 2 2 2018-04-20 0
6 2 3 2018-06-20 1
7 2 4 2018-08-20 1
8 2 5 2018-10-20 0
9 2 6 2019-01-20 0
with two drivers, multiple races at different dates and a win column indicating if a driver won a race or not. The driverID is a unique identifier for each driver, and I am trying to count up their wins, the number of months from their first win, and the last column (World Class (WC)), i.e. to identify the world-class drivers based on my definition of world-class (which is a driver that has at least two wins in their first six months of racing). So the resulting data frame should look something like this:
driverID raceID race_date win win_total months_from_first WC
0 1 2 2018-04-20 1 1 0 n
1 1 3 2018-06-20 0 1 2 n
2 1 4 2018-08-20 0 1 4 n
3 1 5 2018-10-20 1 2 6 y
4 1 6 2019-01-20 1 3 9 y
5 2 2 2018-04-20 0 0 0 n
6 2 3 2018-06-20 1 1 0 n
7 2 4 2018-08-20 1 2 2 y
8 2 5 2018-10-20 0 2 4 y
9 2 6 2019-01-20 0 2 7 y
I'm struggling to solve my first two goals and have searched around, but I can't find any questions similar to this, plus I struggled to find similar questions. So any help is much appreciated, thanks!! I believe I can solve the last (i.e make the WC column once I get help).
CodePudding user response:
One option is to groupby "driverID" and apply a custom function that does various manipulations to get the desired data.
def get_data(x):
# find the cumulative sum of wins
x['win_total'] = x['win'].cumsum()
# get the difference in months between race date and the first win
x['months_from_first'] = (x['race_date'] - x.loc[x['win']==1, 'race_date'].iloc[0])//np.timedelta64(1, 'M')
# fill race dates before first win with 0
x.loc[x['months_from_first'] < 0, 'months_from_first'] = 0
return x
df['race_date'] = pd.to_datetime(df['race_date'])
df = df.groupby('driverID').apply(get_data)
# define WC using 2 wins as a cutoff point
df['WC'] = np.where(df['win_total'] >= 2, 'y', 'n')
Output:
driverID raceID race_date win win_total months_from_first WC
0 1 2 2018-04-20 1 1 0 n
1 1 3 2018-06-20 0 1 2 n
2 1 4 2018-08-20 0 1 4 n
3 1 5 2018-10-20 1 2 6 y
4 1 6 2019-01-20 1 3 9 y
5 2 2 2018-04-20 0 0 0 n
6 2 3 2018-06-20 1 1 0 n
7 2 4 2018-08-20 1 2 2 y
8 2 5 2018-10-20 0 2 4 y
9 2 6 2019-01-20 0 2 7 y
CodePudding user response:
To get the first two columns:
import pandas as pd
df['race_date'] = pd.to_datetime(df['race_date'])
df['total']=df.groupby('driverID')['win'].cumsum()
df['month']= df['race_date'].dt.to_period('M') - df.groupby('driverID')['race_date'].transform('min').dt.to_period('M')
df
