I have a dataset that is reporting values for a specific date, that can then be updated on subsequent dates, thus creating 2 columns, Date and Reported_Date, for each Reported_Value. There is a separate ID field that is my dataframe's index. I want to calculate the mode and max for the last 5 reported dates. I know I can use dataset['Reported_Value'].rolling(5).max() to calculate the max, but trying rolling with mode leads to an error, 'Rolling' object has no attribute 'mode'. Does anyone know how this could be achieved? Is there also a way to have it only compute across one Date? So that the first few values of 2021-12-02 weren't using 2021-12-01 values?
Example DataFrame:
ID Date Reported_Date Reported_Value Max_Last_5_Reported_Days
1 2021-12-01 2021-12-10 5 NaN
2 2021-12-01 2021-12-11 6 NaN
3 2021-12-01 2021-12-12 5 NaN
4 2021-12-01 2021-12-13 3 NaN
5 2021-12-01 2021-12-14 2 6
6 2021-12-01 2021-12-15 11 11
7 2021-12-01 2021-12-16 7 11
8 2021-12-01 2021-12-17 5 11
9 2021-12-01 2021-12-18 6 11
10 2021-12-01 2021-12-19 7 11
11 2021-12-02 2021-12-10 2 7
12 2021-12-02 2021-12-11 3 7
13 2021-12-02 2021-12-12 2 7
14 2021-12-02 2021-12-13 4 7
15 2021-12-02 2021-12-14 4 4
16 2021-12-02 2021-12-15 4 4
17 2021-12-02 2021-12-16 3 4
18 2021-12-02 2021-12-17 4 4
19 2021-12-02 2021-12-18 2 4
20 2021-12-02 2021-12-19 4 4
Desired DataFrame:
ID Date Reported_Date Reported_Value Max_Last_5_Report_Days Mode_L5RD
1 2021-12-01 2021-12-10 5 NaN NaN
2 2021-12-01 2021-12-11 6 NaN NaN
3 2021-12-01 2021-12-12 5 NaN NaN
4 2021-12-01 2021-12-13 3 NaN NaN
5 2021-12-01 2021-12-14 2 6 5
6 2021-12-01 2021-12-15 11 11 NaN
7 2021-12-01 2021-12-16 6 11 NaN
8 2021-12-01 2021-12-17 5 11 NaN
9 2021-12-01 2021-12-18 6 11 6
10 2021-12-01 2021-12-19 6 11 6
11 2021-12-02 2021-12-10 2 NaN NaN
12 2021-12-02 2021-12-11 3 NaN NaN
13 2021-12-02 2021-12-12 2 NaN NaN
14 2021-12-02 2021-12-13 4 NaN NaN
15 2021-12-02 2021-12-14 4 4 4
16 2021-12-02 2021-12-15 4 4 4
17 2021-12-02 2021-12-16 3 4 4
18 2021-12-02 2021-12-17 4 4 4
19 2021-12-02 2021-12-18 2 4 4
20 2021-12-02 2021-12-19 4 4 4
I'm not sure how I would convey that there are multiple mode values, so they're listed as NaN in the example.
CodePudding user response:
groupby "Date" and use rolling_max for max for last 5 days; apply scipy.stats.mode for mode:
from scipy.stats import mode
rolling_obj = df.groupby('Date')['Reported_Value'].rolling(5)
df['Max_Last_5_Report_Days'] = rolling_obj.max().droplevel(0)
df['Mode_L5RD'] = rolling_obj.apply(lambda x: mode(x)[0]).droplevel(0)
Output:
ID Date Reported_Date Reported_Value Max_Last_5_Reported_Days \
0 1 2021-12-01 2021-12-10 5 NaN
1 2 2021-12-01 2021-12-11 6 NaN
2 3 2021-12-01 2021-12-12 5 NaN
3 4 2021-12-01 2021-12-13 3 NaN
4 5 2021-12-01 2021-12-14 2 6.0
5 6 2021-12-01 2021-12-15 11 11.0
6 7 2021-12-01 2021-12-16 7 11.0
7 8 2021-12-01 2021-12-17 5 11.0
8 9 2021-12-01 2021-12-18 6 11.0
9 10 2021-12-01 2021-12-19 7 11.0
10 11 2021-12-02 2021-12-10 2 7.0
11 12 2021-12-02 2021-12-11 3 7.0
12 13 2021-12-02 2021-12-12 2 7.0
13 14 2021-12-02 2021-12-13 4 7.0
14 15 2021-12-02 2021-12-14 4 4.0
15 16 2021-12-02 2021-12-15 4 4.0
16 17 2021-12-02 2021-12-16 3 4.0
17 18 2021-12-02 2021-12-17 4 4.0
18 19 2021-12-02 2021-12-18 2 4.0
19 20 2021-12-02 2021-12-19 4 4.0
Max_Last_5_Report_Days Mode_L5RD
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 6.0 5.0
5 11.0 2.0
6 11.0 2.0
7 11.0 2.0
8 11.0 2.0
9 11.0 7.0
10 NaN NaN
11 NaN NaN
12 NaN NaN
13 NaN NaN
14 4.0 2.0
15 4.0 4.0
16 4.0 4.0
17 4.0 4.0
18 4.0 4.0
19 4.0 4.0
CodePudding user response:
Mode is not a predefined function, however you can apply a custom function using rolling(5).apply(custom_function). For your case that could be
dataset['Reported_Value'].rolling(5).apply(lamba s: s.mode())
