Home > Mobile >  Create dataframe column that computes a rolling average while excluding values
Create dataframe column that computes a rolling average while excluding values

Time:01-13

I'm looking for an efficient way to create the 'average_units' column - which, for each group in Col B, goes row by row and computes the average units for previous 4 days while EXCLUDING 0 unit days. Simple example below:

Original DF

   ColA        ColB      Units
2021-01-01      A          2
2021-01-02      A          0
2021-01-03      A          4
2021-01-04      A          8
2021-01-05      A          1
2021-01-06      A          0
2021-01-07      A          0
2021-01-08      A          7
2021-01-01      B          3
2021-01-02      B          6
2021-01-03      B          9
2021-01-04      B          0
2021-01-05      B          1
2021-01-06      B          7
2021-01-07      B          0
2021-01-08      B          5

Desired DF... Example of 'Avg Units' calculation: On 2021-01-08 in group A... calculation will look at the previous 4 days of units (0,0,1,8)...and will compute the average with the 0's excluded - thus 1 8 / 2 = 4.5

   ColA        ColB      Units    Avg Units
2021-01-01      A          2         NA
2021-01-02      A          0         NA
2021-01-03      A          4         NA
2021-01-04      A          8         NA
2021-01-05      A          1         7
2021-01-06      A          0         6.5
2021-01-07      A          0         6.5
2021-01-08      A          7         4.5
2021-01-01      B          3         NA
2021-01-02      B          6         NA
2021-01-03      B          9         NA
2021-01-04      B          0         NA
2021-01-05      B          1         9
2021-01-06      B          7         8
2021-01-07      B          0         8.5
2021-01-08      B          5         4

I know this can be done with some loops but it would be super complex and time consuming because I have 1000's of groups and more dates. Any suggestions is appreciated!

CodePudding user response:

import numpy as np
def rolling_average_custom(group):
    group_parsed = group.iloc[0:4].replace(0, np.nan)
    return group_parsed.mean()

tmp.groupby('ColB').rolling(min_periods=5, window=5).apply(rolling_average_custom)

returns

        Units
ColB        
A   0   NaN
1   NaN
2   NaN
3   NaN
4   4.666667
5   4.333333
6   4.333333
7   4.500000
B   8   NaN
9   NaN
10  NaN
11  NaN
12  6.000000
13  5.333333
14  5.666667
15  4.000000

I think you have some mistakes in your example. Like the first value should not be 7, but (8 4 2)/3. If that's the case, the solution should work.

To add to the dataframe:

rolling_output = tmp.groupby('ColB').rolling(min_periods=5, window=5).apply(rolling_average_custom)
tmp = tmp.set_index(['ColB'], append=True)
tmp = tmp.swaplevel()
tmp['average'] = rolling_output

Result:

        ColA    Units   average
ColB                
A   0   2021-01-01  2   NaN
1   2021-01-02  0   NaN
2   2021-01-03  4   NaN
3   2021-01-04  8   NaN
4   2021-01-05  1   4.666667
5   2021-01-06  0   4.333333
6   2021-01-07  0   4.333333
7   2021-01-08  7   4.500000
B   8   2021-01-01  3   NaN
9   2021-01-02  6   NaN
10  2021-01-03  9   NaN
11  2021-01-04  0   NaN
12  2021-01-05  1   6.000000
13  2021-01-06  7   5.333333
14  2021-01-07  0   5.666667
15  2021-01-08  5   4.000000
  •  Tags:  
  • Related