Calculate 2-month rolling average for group in Pandas-CodePudding

I'm trying to calculate a two-month rolling average of a player's 'score' using the below df:

df = pd.DataFrame({'player_id': [1098, 1098, 1098, 1098, 1116, 1116, 1116, 1116, 1116, 1116, 1116, 1116],
              'date': ['2018-06-22', '2018-06-23', '2018-07-24', '2018-07-25', 
                      '2018-07-22', '2018-07-23', '2018-07-24', '2018-07-25',
                      '2018-08-22', '2018-08-23', '2018-08-24', '2018-08-25'],
              'score': [-2,1,2,3,-8,3,2,-3,-2,1,2,3]})

I'd want the averages to be grouped by the 'player_ID' column so that player 1098 would have averages of -2, -0.5, 0.33, and 1 for each row, respectively. If there's just one date in 2020, it'd just be that value. If there's a date in January and February of 2020, the January row would just be that value while the Feb row would be the average of those two.

CodePudding user response：

You can group by 'player_id' and use expanding().mean() method on 'score' column:

df['rolling mean'] = df.groupby('player_id')['score'].expanding().mean().round(2).droplevel(0)

Edit:

Given the new information in the comments, perhaps you want rolling.mean method. Added some rows to the dataframe from the OP to better show what's going on. For the DataFrame df:

df = pd.DataFrame({'player_id': [1098, 1098, 1098, 1098, 1098, 1098, 1098, 1098, 
                                 1116, 1116, 1116, 1116, 1116, 1116, 1116, 1116], 
                   'date': ['2018-06-22', '2018-06-23', '2018-07-24', '2018-07-25', 
                            '2019-06-22', '2019-06-25', '2019-07-25', '2020-06-22', 
                            '2018-07-22', '2018-07-23', '2018-07-24', '2018-07-25', 
                            '2018-08-22', '2018-08-23', '2018-08-24', '2018-08-25'], 
                   'score': [-2, 1, 2, 3, 7, 8, 6, 5, -8, 3, 2, -3, -2, 1, 2, 3]})

Here we find the rolling mean over 60 days for each 'player_id':

df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(by=['player_id','date'])
df['rolling_mean'] = df.set_index('date').groupby('player_id', sort=False)['score'].rolling('60D').mean().round(2).to_numpy()

Output:

    player_id       date  score  rolling_mean
0        1098 2018-06-22     -2         -2.00
1        1098 2018-06-23      1         -0.50
2        1098 2018-07-24      2          0.33
3        1098 2018-07-25      3          1.00
6        1098 2019-06-22      7          7.00
7        1098 2019-06-25      8          7.50
5        1098 2019-07-25      6          7.00
4        1098 2020-06-22      5          5.00
8        1116 2018-07-22     -8         -8.00
9        1116 2018-07-23      3         -2.50
10       1116 2018-07-24      2         -1.00
11       1116 2018-07-25     -3         -1.50
12       1116 2018-08-22     -2         -1.60
13       1116 2018-08-23      1         -1.17
14       1116 2018-08-24      2         -0.71
15       1116 2018-08-25      3         -0.25

CodePudding user response：

Not sure if it's the most elegant, but here you go:

import pandas as pd
    
#DataFrame you provided
df = pd.DataFrame({'player_id': [1098, 1098, 1098, 1098, 1116, 1116, 1116, 1116, 1116, 1116, 1116, 1116],
                  'date': ['2018-06-22', '2018-06-23', '2018-07-24', '2018-07-25', 
                          '2018-07-22', '2018-07-23', '2018-07-24', '2018-07-25',
                          '2018-08-22', '2018-08-23', '2018-08-24', '2018-08-25'],
                  'score': [-2,1,2,3,-8,3,2,-3,-2,1,2,3]})
    
    #As best practice, set date strings to date type
    df['date'] = pd.to_datetime(df['date'])
    
    """
    Group by player and date, then add a rolling average with minimum level 1
    (meaning you can start rolling from the first row). The rolling function creates a series
    from your dataframe
    """
    rolling_average_series = df.groupby(by=['player_id','date']).rolling(1)['score'].mean()

#Reset series as dataframe with .to_frame then reset index so that player_id and date are still columns
df_grouped = rolling_average_series.to_frame().reset_index(level=['player_id','date'])

Apologies for the weird formatting.

CodePudding user response：

df = pd.DataFrame({'player_id': [1098, 1098, 1098, 1098, 1116, 1116, 1116, 1116, 1116, 1116, 1116, 1116],
              'date': ['2018-06-22', '2018-06-23', '2018-07-24', '2018-07-25', 
                      '2018-07-22', '2018-07-23', '2018-07-24', '2018-07-25',
                      '2018-08-22', '2018-08-23', '2018-08-24', '2018-08-25'],
              'score': [-2,1,2,3,-8,3,2,-3,-2,1,2,3]})

window=4
min_periods=3
cap=10

def get_month(date):
    return date.month

df=df.sort_values(by='date')
df['date']=pd.to_datetime(df['date'])
df['month']=df['date'].apply(get_month)
df['elapsed_days'] = (df['date']-df['date'].shift(1)).dt.days
df['accumulative_elapsed_days'] = df['elapsed_days'].cumsum()

df['rolling_mean'] = df['score'].rolling(window, min_periods).mean().round(2)

print(df[['date','score','month','elapsed_days','accum_elapsed_days','rolling_mean']]) df.set_index('date',inplace=True)

df[['score','rolling_mean']].plot()
plt.show()

output:

         date  score  month  elapsed_days  accum_elapsed_days  rolling_mean
0  2018-06-22     -2      6           NaN                 NaN           NaN
1  2018-06-23      1      6           1.0                 1.0           NaN
4  2018-07-22     -8      7          29.0                30.0         -3.00
5  2018-07-23      3      7           1.0                31.0         -1.50
2  2018-07-24      2      7           1.0                32.0         -0.50
6  2018-07-24      2      7           0.0                32.0         -0.25
3  2018-07-25      3      7           1.0                33.0          2.50
7  2018-07-25     -3      7           0.0                33.0          1.00
8  2018-08-22     -2      8          28.0                61.0          0.00
9  2018-08-23      1      8           1.0                62.0         -0.25
10 2018-08-24      2      8           1.0                63.0         -0.50
11 2018-08-25      3      8           1.0                64.0          1.00