I have this dataframe:
| Date | Position | TrainerID |
|---|---|---|
| 2017-09-03 | 4 | 1788 |
| 2017-09-16 | 5 | 1788 |
| 2017-10-14 | 1 | 1788 |
| 2017-10-14 | 3 | 1788 |
I want to compute on every row the winning percentage so far for the races in the last 1000 days, for a certain TrainerID. The result will be stored in a Win% Column. Dates need not to be unique. However, the winning % is considered to be before the race happened, so the current row is excluded. Thus, the results are delayed by one row.
So, what I am looking for is a result like this:
| Date | Position | TrainerID | Win% |
|---|---|---|---|
| 2017-09-03 | 4 | 1788 | 0 (0 wins, 0 races) |
| 2017-09-16 | 5 | 1788 | 0 (0 wins, 1 races) |
| 2017-10-14 | 1 | 1788 | 0 (0 wins, 2 races) |
| 2017-10-14 | 3 | 1788 | 33 (1 win, 3 races) |
I have most of the problem solved. I know I should add a shift(1) to the code, but I am not sure where it should go. My last attempt was:
df['Win'] = df['Position'].eq(1)
df = (df.reset_index().groupby('TrainerID').shift(1).apply(
lambda x: x.rolling(no_days, on='Date').agg(
{'Win': 'mean', 'index': 'max'}).reset_index(drop=True).set_index('index').shift().mul(100).round(2)))
But I got the following error message:
invalid on specified as Date, must be a column (of DataFrame), an Index or None
So, where do I put the shift() function? How can I solve this?
CodePudding user response:
You could use rolling with the Date index a datetime:
df['Date'] = pd.to_datetime(df['Date'])
df['Win%'] = (
df.set_index('Date')
.rolling('1000d') # last 1000 days
['Position']
.apply(lambda s: round(s.eq(1).sum()/len(s)*100))
.shift()
.values
)
output:
Date Position TrainerID Win%
0 2017-09-03 4 1788 NaN
1 2017-09-16 5 1788 0.0
2 2017-10-14 1 1788 0.0
3 2017-10-14 3 1788 33.0
NB. it is not clear if there are several "TrainerID", but in this case you could also perform everything grouped by "TrainerID"
applying per group
df['Win%'] = (
df.set_index('Date')
.groupby('TrainerID')
.rolling('1000d')['Position']
.apply(lambda s: round(s.eq(1).sum()/len(s)*100))
.groupby('TrainerID').shift()
.values
)
output:
Date Position TrainerID Win%
0 2017-09-03 4 1788 NaN
1 2017-09-16 5 1788 0.0
2 2017-10-14 1 1788 0.0
3 2017-10-14 3 1788 33.0
4 2017-09-03 4 1789 NaN
5 2017-09-16 5 1789 0.0
6 2017-10-14 1 1789 0.0
7 2017-10-14 3 1789 33.0
