Let's say I have a dataframe with 3 columns, dt, unit, sold. What I would like to know how to do is how to create a new column called say, prior_3_avg, that is as the name suggests, an average of sold by unit for the past three same-day-of-week as dt. E.g., for unit "1" on May 5th 2020, what's the average it sold on April 28th, 21st, and 14th, which are the last three thursdays?
Toy sample data:
df = pd.DataFrame({'dt':['2020-5-1','2020-5-2','2020-5-3','2020-5-4','2020-5-5','2020-5-6','2020-5-7','2020-5-8','2020-5-9','2020-5-10','2020-5-11','2020-5-12','2020-5-13','2020-5-14','2020-5-15','2020-5-16','2020-5-17','2020-5-18','2020-5-19','2020-5-20','2020-5-21','2020-5-22','2020-5-23','2020-5-24','2020-5-25','2020-5-26','2020-5-27','2020-5-28','2020-5-1','2020-5-2','2020-5-3','2020-5-4','2020-5-5','2020-5-6','2020-5-7','2020-5-8','2020-5-9','2020-5-10','2020-5-11','2020-5-12','2020-5-13','2020-5-14','2020-5-15','2020-5-16','2020-5-17','2020-5-18','2020-5-19','2020-5-20','2020-5-21','2020-5-22','2020-5-23','2020-5-24','2020-5-25','2020-5-26','2020-5-27','2020-5-28',],'unit':[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'sold':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28]})
df['dt'] = pd.to_datetime(df['dt'])
dt unit sold
0 2020-05-01 1 1
1 2020-05-02 1 2
2 2020-05-03 1 3
3 2020-05-04 1 4
4 2020-05-05 1 5
5 2020-05-06 1 6
...
How would I go about this? I've seen: Pandas new column from groupby averages
That explains how to just do a group by on the columns. I figure I could do a "day of week" column, but then I still have the same problem of wanting to limit to the past 3 matching day of week values instead of just all of the results.
It could possibly have something to do with this, but this looks more like it's useful for one-off analysis than making a new column: limit amount of rows as result of groupby Pandas
CodePudding user response:
This should work:
df['dayofweek'] = df['dt'].dt.dayofweek
df['output'] = df.apply(lambda x: df['sold'][(df.index < x.name) & (df.dayofweek == x.dayofweek)].tail(3).sum(), axis = 1)
CodePudding user response:
first create a new columns with the day
import pandas as pd
date = pd.date_range('2018-12-30', '2019-01-07',
freq='D').to_series()
date.dt.dayofweek
That will give you the number for the day and after you just need to filter with the month and sort the value
CodePudding user response:
Here is one idea: First group by unit, then group each unit by weekdays and get the rolling average for n weeks (with closed='left', the last n weeks excluding the current one are used for calculation, which seems to be what you want)...
n = 3
result = (df.groupby('unit')
.apply(lambda f: (f['sold']
.groupby(f.dt.dt.day_name())
.rolling(n, closed='left')
.mean()
)
)
)
...which results in this series:
unit dt
1 Friday 0 NaN
7 NaN
14 NaN
21 8.0
Monday 3 NaN
10 NaN
17 NaN
24 11.0
...
2 Friday 28 NaN
35 NaN
42 NaN
49 8.0
Monday 31 NaN
38 NaN
45 NaN
52 11.0
...
Name: sold, dtype: float64
Next, get rid of the unit and time index levels, we don't need them.
Also, rename the series for easier joining.
result = result.reset_index(level=[0, 1], drop=True)
result = result.rename('prior_3_avg')
Back to the mothership...
df2 = df.join(result)
Part of the final result in df2:
time unit sold prior_3_avg
... # first 21 are NaN
21 2020-05-22 1 22 8.0
22 2020-05-23 1 23 9.0
23 2020-05-24 1 24 10.0
24 2020-05-25 1 25 11.0
25 2020-05-26 1 26 12.0
26 2020-05-27 1 27 13.0
27 2020-05-28 1 28 14.0
