Given the following DataFrame:
---- -------- ------------ ------ ---------------------
| id | player | match_date | stat | days_since_max_stat |
---- -------- ------------ ------ ---------------------
| 1 | 1 | 2022-01-01 | 1500 | NaN |
| 2 | 1 | 2022-01-03 | 1600 | 2 |
| 3 | 1 | 2022-01-10 | 2100 | 7 |
| 4 | 1 | 2022-01-11 | 1800 | 1 |
| 5 | 1 | 2022-01-18 | 1700 | 8 |
| 6 | 2 | 2022-01-01 | 1600 | NaN |
| 7 | 2 | 2022-01-03 | 1800 | 2 |
| 8 | 2 | 2022-01-10 | 1600 | 7 |
| 9 | 2 | 2022-01-11 | 1900 | 8 |
| 10 | 2 | 2022-01-18 | 1500 | 7 |
---- -------- ------------ ------ ---------------------
How would I calculate the days_since_max_stat column? The calculation of this column is exclusive of the stat in that row and per player.
For example the value for the row where id = 5 is 8 because the max stat was in the row where id = 3. The days_since_max_stat = 2022-01-18 - 2022-01-10 = 8.
Here's the base DataFrame:
import datetime as dt
import pandas as pd
dates = [
dt.datetime(2022, 1, 1),
dt.datetime(2022, 1, 3),
dt.datetime(2022, 1, 10),
dt.datetime(2022, 1, 11),
dt.datetime(2022, 1, 18),
]
df = pd.DataFrame(
{
"id": range(1, 11),
"player": [1 for i in range(5)] [2 for i in range(5)],
"match_date": dates dates,
"stat": (1500, 1600, 2100, 1800, 1700, 1600, 1800, 1600, 1900, 1500)
}
)
CodePudding user response:
You can use a double groupby. The important part is to compute a new group to put together the rows that are lower than the last max. Once you have done that this is a simple cumsum per group:
g = df.groupby(df['player'])
# date diff per group (days)
diff = g['match_date'].diff().dt.days
# group per lower than last max
group = df['stat'].ge(g['stat'].cummax()).shift().cumsum()
# days since last max
df['dsms'] = diff.groupby([df['player'], group]).cumsum()
Output:
id player match_date stat dsms
0 1 1 2022-01-01 1500 NaN
1 2 1 2022-01-03 1600 2.0
2 3 1 2022-01-10 2100 7.0
3 4 1 2022-01-11 1800 1.0
4 5 1 2022-01-18 1700 8.0
5 6 2 2022-01-01 1600 NaN
6 7 2 2022-01-03 1800 2.0
7 8 2 2022-01-10 1600 7.0
8 9 2 2022-01-11 1900 8.0
CodePudding user response:
First imagine you have only one id, then you can use expanding to find the cummulative max/idxmax. then you can subtract:
def day_since_max(data):
maxIdx = data['stat'].expanding().apply(pd.Series.idxmax)
date_at_max = data.loc[maxIdx, 'match_date'].shift()
return data['match_date'] - date_at_max.values
Now, we can use groupby().apply to apply that function for each id:
df['days_since_max'] = df.groupby('player').apply(day_since_max).reset_index(level=0, drop=True)
Output:
id player match_date stat days_since_max
0 1 1 2022-01-01 1500 NaT
1 2 1 2022-01-03 1600 2 days
2 3 1 2022-01-10 2100 7 days
3 4 1 2022-01-11 1800 1 days
4 5 1 2022-01-18 1700 8 days
5 6 2 2022-01-01 1600 NaT
6 7 2 2022-01-03 1800 2 days
7 8 2 2022-01-10 1600 7 days
8 9 2 2022-01-11 1900 8 days
9 10 2 2022-01-18 1500 7 days
