This is a mock of my actual use-case, suppose I have the following:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.arange(1, 32),
columns=['a'],
index=pd.date_range('2021-01-01', '2021-01-31'))
I'd like to pre-filter resampling groups to a minimum number of X rows before applying a function. I could find the number of rows per resampling group as follows:
df.resample('W').apply(lambda x: x.shape[0])
> a
> 2021-01-03 3
> 2021-01-10 7
> 2021-01-17 7
> 2021-01-24 7
> 2021-01-31 7
In that example above I would like to run my resampling excluding or filtering out the subgroup corresponding to date 2021-01-03 because it only has 3 rows, periods or days.
The desired output is, for example, taking mean, it should exclude the case of 2021-01-03 BUT before applying the actual function, not after:
df.resample('W').filter_here_to_min_7_rows().apply(lambda x: x.mean())
> a
> 2021-01-10 7
> 2021-01-17 14
> 2021-01-24 21
> 2021-01-31 28
UPDATE More context, why do I need this? in my real use-case the apply function is much more complex and if the minimum number of rows is not satisfied, I need to handle a border case providing an empty result which impacts the aggregated resampling type result. Therefore I'd prefer to have the apply function getting correct input groups.
CodePudding user response:
You can use loc with a callable to filter the aggregated rows:
df.resample('W')['a'].agg(['count', 'mean']).loc[lambda df: df['count'] >= 7, 'mean']
CodePudding user response:
You can also reuse Resampler/GroupBy objects:
g = df.resample('W')['a']
g.mean()[g.count().ge(7)]
output:
2021-01-10 7.0
2021-01-17 14.0
2021-01-24 21.0
2021-01-31 28.0
Freq: W-SUN, Name: a, dtype: float64
only computing the aggregation function when the group matches a criterion
If your goal is to avoid computation of the mean (or any other potentially computation expensive aggregation) in case the group does not match a condition, you could use apply:
df.resample('W')['a'].apply(lambda x: x.mean() if x.count()>=7 else None).dropna()
NB. In case of mean on this dataset this will likely be slower, especially since here you expect small groups (max. 7 days), but this might be better for other aggregation functions or very large groups.
CodePudding user response:
Another option, using pipe:
df.resample(on='W').a.pipe(lambda g: g.mean()[g.size() >= 7])
2021-01-10 7.0
2021-01-17 14.0
2021-01-24 21.0
2021-01-31 28.0
Freq: W-SUN, Name: a, dtype: float64
