Using multiple conditional statements to make new columns-CodePudding

I have a data frame, train_df:

data = {'driverID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
         'raceID': [2, 3, 4, 5, 6, 2, 3, 4, 5, 6],
         'race_date': ['2018-04-20', '2018-06-20', '2018-08-20', '2018-10-20', '2019-01-20',
                       '2018-04-20', '2018-06-20', '2018-08-20', '2018-10-20', '2019-01-20'],
         'win': [1, 0, 0, 1, 1, 0, 1, 1, 0, 0]}

train_df = pd.DataFrame(data)

   driverID  raceID    race_date       win
0   1          2       2018-04-20       1
1   1          3       2018-06-20       0
2   1          4       2018-08-20       0
3   1          5       2018-10-20       1
4   1          6       2019-01-20       1
5   2          2       2018-04-20       0
6   2          3       2018-06-20       1
7   2          4       2018-08-20       1
8   2          5       2018-10-20       0
9   2          6       2019-01-20       0

with two drivers, multiple races at different dates and a win column indicating if a driver won a race or not. The driverID is a unique identifier for each driver, and I am trying to count up their wins, the number of months from their first win, and the last column (World Class (WC)), i.e. to identify the world-class drivers based on my definition of world-class (which is a driver that has at least two wins in their first six months of racing). So the resulting data frame should look something like this:

   driverID  raceID    race_date       win      win_total     months_from_first  WC
0   1          2       2018-04-20       1          1                  0          n 
1   1          3       2018-06-20       0          1                  2          n
2   1          4       2018-08-20       0          1                  4          n 
3   1          5       2018-10-20       1          2                  6          y
4   1          6       2019-01-20       1          3                  9          y
5   2          2       2018-04-20       0          0                  0          n
6   2          3       2018-06-20       1          1                  0          n
7   2          4       2018-08-20       1          2                  2          y
8   2          5       2018-10-20       0          2                  4          y
9   2          6       2019-01-20       0          2                  7          y

I'm struggling to solve my first two goals and have searched around, but I can't find any questions similar to this, plus I struggled to find similar questions. So any help is much appreciated, thanks!! I believe I can solve the last (i.e make the WC column once I get help).

CodePudding user response：

One option is to groupby "driverID" and apply a custom function that does various manipulations to get the desired data.

def get_data(x):
    # find the cumulative sum of wins
    x['win_total'] = x['win'].cumsum()
    # get the difference in months between race date and the first win
    x['months_from_first'] = (x['race_date'] - x.loc[x['win']==1, 'race_date'].iloc[0])//np.timedelta64(1, 'M')
    # fill race dates before first win with 0
    x.loc[x['months_from_first'] < 0, 'months_from_first'] = 0 
    return x

df['race_date'] = pd.to_datetime(df['race_date'])
df = df.groupby('driverID').apply(get_data)

# define WC using 2 wins as a cutoff point
df['WC'] = np.where(df['win_total'] >= 2, 'y', 'n')

Output:

   driverID  raceID  race_date  win  win_total  months_from_first WC
0         1       2 2018-04-20    1          1                  0  n
1         1       3 2018-06-20    0          1                  2  n
2         1       4 2018-08-20    0          1                  4  n
3         1       5 2018-10-20    1          2                  6  y
4         1       6 2019-01-20    1          3                  9  y
5         2       2 2018-04-20    0          0                  0  n
6         2       3 2018-06-20    1          1                  0  n
7         2       4 2018-08-20    1          2                  2  y
8         2       5 2018-10-20    0          2                  4  y
9         2       6 2019-01-20    0          2                  7  y

CodePudding user response：

To get the first two columns:

import pandas as pd
df['race_date'] = pd.to_datetime(df['race_date'])
df['total']=df.groupby('driverID')['win'].cumsum()
df['month']= df['race_date'].dt.to_period('M') - df.groupby('driverID')['race_date'].transform('min').dt.to_period('M')
df