Find the number of days since a max value-CodePudding

Given the following DataFrame:

 ---- -------- ------------ ------ --------------------- 
| id | player | match_date | stat | days_since_max_stat |
 ---- -------- ------------ ------ --------------------- 
|  1 |      1 | 2022-01-01 | 1500 | NaN                 |
|  2 |      1 | 2022-01-03 | 1600 | 2                   |
|  3 |      1 | 2022-01-10 | 2100 | 7                   |
|  4 |      1 | 2022-01-11 | 1800 | 1                   |
|  5 |      1 | 2022-01-18 | 1700 | 8                   |
|  6 |      2 | 2022-01-01 | 1600 | NaN                 |
|  7 |      2 | 2022-01-03 | 1800 | 2                   |
|  8 |      2 | 2022-01-10 | 1600 | 7                   |
|  9 |      2 | 2022-01-11 | 1900 | 8                   |
| 10 |      2 | 2022-01-18 | 1500 | 7                   |
 ---- -------- ------------ ------ ---------------------

How would I calculate the days_since_max_stat column? The calculation of this column is exclusive of the stat in that row and per player.

For example the value for the row where id = 5 is 8 because the max stat was in the row where id = 3. The days_since_max_stat = 2022-01-18 - 2022-01-10 = 8.

Here's the base DataFrame:

import datetime as dt
import pandas as pd


dates = [
    dt.datetime(2022, 1, 1),
    dt.datetime(2022, 1, 3),
    dt.datetime(2022, 1, 10),
    dt.datetime(2022, 1, 11),
    dt.datetime(2022, 1, 18),
]
df = pd.DataFrame(
    {
        "id": range(1, 11),
        "player": [1 for i in range(5)]   [2 for i in range(5)],
        "match_date": dates   dates,
        "stat": (1500, 1600, 2100, 1800, 1700, 1600, 1800, 1600, 1900, 1500)
    }
)

CodePudding user response：

You can use a double groupby. The important part is to compute a new group to put together the rows that are lower than the last max. Once you have done that this is a simple cumsum per group:

g = df.groupby(df['player'])
# date diff per group (days)
diff = g['match_date'].diff().dt.days
# group per lower than last max
group = df['stat'].ge(g['stat'].cummax()).shift().cumsum()
# days since last max
df['dsms'] = diff.groupby([df['player'], group]).cumsum()

Output:

   id  player match_date  stat  dsms
0   1       1 2022-01-01  1500   NaN
1   2       1 2022-01-03  1600   2.0
2   3       1 2022-01-10  2100   7.0
3   4       1 2022-01-11  1800   1.0
4   5       1 2022-01-18  1700   8.0
5   6       2 2022-01-01  1600   NaN
6   7       2 2022-01-03  1800   2.0
7   8       2 2022-01-10  1600   7.0
8   9       2 2022-01-11  1900   8.0

CodePudding user response：

First imagine you have only one id, then you can use expanding to find the cummulative max/idxmax. then you can subtract:

def day_since_max(data):
    maxIdx = data['stat'].expanding().apply(pd.Series.idxmax)
    date_at_max = data.loc[maxIdx, 'match_date'].shift()
    return data['match_date'] - date_at_max.values

Now, we can use groupby().apply to apply that function for each id:

df['days_since_max'] = df.groupby('player').apply(day_since_max).reset_index(level=0, drop=True)

Output:

   id  player match_date  stat days_since_max
0   1       1 2022-01-01  1500            NaT
1   2       1 2022-01-03  1600         2 days
2   3       1 2022-01-10  2100         7 days
3   4       1 2022-01-11  1800         1 days
4   5       1 2022-01-18  1700         8 days
5   6       2 2022-01-01  1600            NaT
6   7       2 2022-01-03  1800         2 days
7   8       2 2022-01-10  1600         7 days
8   9       2 2022-01-11  1900         8 days
9  10       2 2022-01-18  1500         7 days