Home > database >  Python - Calculating a rolling mode on a dataframe
Python - Calculating a rolling mode on a dataframe

Time:01-22

I have a dataset that is reporting values for a specific date, that can then be updated on subsequent dates, thus creating 2 columns, Date and Reported_Date, for each Reported_Value. There is a separate ID field that is my dataframe's index. I want to calculate the mode and max for the last 5 reported dates. I know I can use dataset['Reported_Value'].rolling(5).max() to calculate the max, but trying rolling with mode leads to an error, 'Rolling' object has no attribute 'mode'. Does anyone know how this could be achieved? Is there also a way to have it only compute across one Date? So that the first few values of 2021-12-02 weren't using 2021-12-01 values?

Example DataFrame:

    ID    Date          Reported_Date    Reported_Value    Max_Last_5_Reported_Days
     1    2021-12-01    2021-12-10                   5                 NaN
     2    2021-12-01    2021-12-11                   6                 NaN
     3    2021-12-01    2021-12-12                   5                 NaN
     4    2021-12-01    2021-12-13                   3                 NaN
     5    2021-12-01    2021-12-14                   2                 6
     6    2021-12-01    2021-12-15                   11                11
     7    2021-12-01    2021-12-16                   7                 11
     8    2021-12-01    2021-12-17                   5                 11
     9    2021-12-01    2021-12-18                   6                 11
     10   2021-12-01    2021-12-19                   7                 11
     11   2021-12-02    2021-12-10                   2                 7
     12   2021-12-02    2021-12-11                   3                 7
     13   2021-12-02    2021-12-12                   2                 7
     14   2021-12-02    2021-12-13                   4                 7
     15   2021-12-02    2021-12-14                   4                 4
     16   2021-12-02    2021-12-15                   4                 4
     17   2021-12-02    2021-12-16                   3                 4
     18   2021-12-02    2021-12-17                   4                 4
     19   2021-12-02    2021-12-18                   2                 4
     20   2021-12-02    2021-12-19                   4                 4

Desired DataFrame:

    ID    Date          Reported_Date    Reported_Value    Max_Last_5_Report_Days   Mode_L5RD
     1    2021-12-01    2021-12-10                   5                 NaN             NaN
     2    2021-12-01    2021-12-11                   6                 NaN             NaN
     3    2021-12-01    2021-12-12                   5                 NaN             NaN
     4    2021-12-01    2021-12-13                   3                 NaN             NaN
     5    2021-12-01    2021-12-14                   2                 6               5
     6    2021-12-01    2021-12-15                   11                11              NaN
     7    2021-12-01    2021-12-16                   6                 11              NaN
     8    2021-12-01    2021-12-17                   5                 11              NaN
     9    2021-12-01    2021-12-18                   6                 11              6
     10   2021-12-01    2021-12-19                   6                 11              6
     11   2021-12-02    2021-12-10                   2                 NaN             NaN
     12   2021-12-02    2021-12-11                   3                 NaN             NaN
     13   2021-12-02    2021-12-12                   2                 NaN             NaN
     14   2021-12-02    2021-12-13                   4                 NaN             NaN
     15   2021-12-02    2021-12-14                   4                 4               4
     16   2021-12-02    2021-12-15                   4                 4               4
     17   2021-12-02    2021-12-16                   3                 4               4
     18   2021-12-02    2021-12-17                   4                 4               4
     19   2021-12-02    2021-12-18                   2                 4               4
     20   2021-12-02    2021-12-19                   4                 4               4

I'm not sure how I would convey that there are multiple mode values, so they're listed as NaN in the example.

CodePudding user response:

groupby "Date" and use rolling_max for max for last 5 days; apply scipy.stats.mode for mode:

from scipy.stats import mode
rolling_obj = df.groupby('Date')['Reported_Value'].rolling(5)
df['Max_Last_5_Report_Days'] = rolling_obj.max().droplevel(0)
df['Mode_L5RD'] = rolling_obj.apply(lambda x: mode(x)[0]).droplevel(0)

Output:

    ID        Date Reported_Date  Reported_Value  Max_Last_5_Reported_Days  \
0    1  2021-12-01    2021-12-10               5                       NaN   
1    2  2021-12-01    2021-12-11               6                       NaN   
2    3  2021-12-01    2021-12-12               5                       NaN   
3    4  2021-12-01    2021-12-13               3                       NaN   
4    5  2021-12-01    2021-12-14               2                       6.0   
5    6  2021-12-01    2021-12-15              11                      11.0   
6    7  2021-12-01    2021-12-16               7                      11.0   
7    8  2021-12-01    2021-12-17               5                      11.0   
8    9  2021-12-01    2021-12-18               6                      11.0   
9   10  2021-12-01    2021-12-19               7                      11.0   
10  11  2021-12-02    2021-12-10               2                       7.0   
11  12  2021-12-02    2021-12-11               3                       7.0   
12  13  2021-12-02    2021-12-12               2                       7.0   
13  14  2021-12-02    2021-12-13               4                       7.0   
14  15  2021-12-02    2021-12-14               4                       4.0   
15  16  2021-12-02    2021-12-15               4                       4.0   
16  17  2021-12-02    2021-12-16               3                       4.0   
17  18  2021-12-02    2021-12-17               4                       4.0   
18  19  2021-12-02    2021-12-18               2                       4.0   
19  20  2021-12-02    2021-12-19               4                       4.0   

    Max_Last_5_Report_Days  Mode_L5RD  
0                      NaN        NaN  
1                      NaN        NaN  
2                      NaN        NaN  
3                      NaN        NaN  
4                      6.0        5.0  
5                     11.0        2.0  
6                     11.0        2.0  
7                     11.0        2.0  
8                     11.0        2.0  
9                     11.0        7.0  
10                     NaN        NaN  
11                     NaN        NaN  
12                     NaN        NaN  
13                     NaN        NaN  
14                     4.0        2.0  
15                     4.0        4.0  
16                     4.0        4.0  
17                     4.0        4.0  
18                     4.0        4.0  
19                     4.0        4.0  

CodePudding user response:

Mode is not a predefined function, however you can apply a custom function using rolling(5).apply(custom_function). For your case that could be

dataset['Reported_Value'].rolling(5).apply(lamba s: s.mode())
  •  Tags:  
  • Related