Home > database >  How can I fill a column with values that are computed between two dates in pandas, with a delay of o
How can I fill a column with values that are computed between two dates in pandas, with a delay of o

Time:02-07

I have this dataframe:

Date Position TrainerID
2017-09-03 4 1788
2017-09-16 5 1788
2017-10-14 1 1788
2017-10-14 3 1788

I want to compute on every row the winning percentage so far for the races in the last 1000 days, for a certain TrainerID. The result will be stored in a Win% Column. Dates need not to be unique. However, the winning % is considered to be before the race happened, so the current row is excluded. Thus, the results are delayed by one row.

So, what I am looking for is a result like this:

Date Position TrainerID Win%
2017-09-03 4 1788 0 (0 wins, 0 races)
2017-09-16 5 1788 0 (0 wins, 1 races)
2017-10-14 1 1788 0 (0 wins, 2 races)
2017-10-14 3 1788 33 (1 win, 3 races)

I have most of the problem solved. I know I should add a shift(1) to the code, but I am not sure where it should go. My last attempt was:

df['Win'] = df['Position'].eq(1)
df = (df.reset_index().groupby('TrainerID').shift(1).apply(
            lambda x: x.rolling(no_days, on='Date').agg(
                {'Win': 'mean', 'index': 'max'}).reset_index(drop=True).set_index('index').shift().mul(100).round(2)))

But I got the following error message:

invalid on specified as Date, must be a column (of DataFrame), an Index or None

So, where do I put the shift() function? How can I solve this?

CodePudding user response:

You could use rolling with the Date index a datetime:

df['Date'] = pd.to_datetime(df['Date'])
df['Win%'] = (
 df.set_index('Date')
   .rolling('1000d')  # last 1000 days
   ['Position']
   .apply(lambda s: round(s.eq(1).sum()/len(s)*100))
   .shift()
   .values
)

output:

        Date  Position  TrainerID  Win%
0 2017-09-03         4       1788   NaN
1 2017-09-16         5       1788   0.0
2 2017-10-14         1       1788   0.0
3 2017-10-14         3       1788  33.0

NB. it is not clear if there are several "TrainerID", but in this case you could also perform everything grouped by "TrainerID"

applying per group
df['Win%'] = (
 df.set_index('Date')
   .groupby('TrainerID')
   .rolling('1000d')['Position']
   .apply(lambda s: round(s.eq(1).sum()/len(s)*100))
   .groupby('TrainerID').shift()
   .values
)

output:

        Date  Position  TrainerID  Win%
0 2017-09-03         4       1788   NaN
1 2017-09-16         5       1788   0.0
2 2017-10-14         1       1788   0.0
3 2017-10-14         3       1788  33.0
4 2017-09-03         4       1789   NaN
5 2017-09-16         5       1789   0.0
6 2017-10-14         1       1789   0.0
7 2017-10-14         3       1789  33.0
  •  Tags:  
  • Related