How to pre-filter to a minimum number of rows per group before resampling without double-resampling?-CodePudding

This is a mock of my actual use-case, suppose I have the following:

import pandas as pd
import numpy as np

df = pd.DataFrame(data=np.arange(1, 32), 
                  columns=['a'], 
                  index=pd.date_range('2021-01-01', '2021-01-31'))

I'd like to pre-filter resampling groups to a minimum number of X rows before applying a function. I could find the number of rows per resampling group as follows:

df.resample('W').apply(lambda x: x.shape[0])
>              a
> 2021-01-03   3
> 2021-01-10   7
> 2021-01-17   7
> 2021-01-24   7
> 2021-01-31   7

In that example above I would like to run my resampling excluding or filtering out the subgroup corresponding to date 2021-01-03 because it only has 3 rows, periods or days.

The desired output is, for example, taking mean, it should exclude the case of 2021-01-03 BUT before applying the actual function, not after:

df.resample('W').filter_here_to_min_7_rows().apply(lambda x: x.mean())
>                 a
> 2021-01-10      7
> 2021-01-17      14
> 2021-01-24      21
> 2021-01-31      28

UPDATE More context, why do I need this? in my real use-case the apply function is much more complex and if the minimum number of rows is not satisfied, I need to handle a border case providing an empty result which impacts the aggregated resampling type result. Therefore I'd prefer to have the apply function getting correct input groups.

CodePudding user response：

You can use loc with a callable to filter the aggregated rows:

df.resample('W')['a'].agg(['count', 'mean']).loc[lambda df: df['count'] >= 7, 'mean']

CodePudding user response：

You can also reuse Resampler/GroupBy objects:

g = df.resample('W')['a']
g.mean()[g.count().ge(7)]

output:

2021-01-10     7.0
2021-01-17    14.0
2021-01-24    21.0
2021-01-31    28.0
Freq: W-SUN, Name: a, dtype: float64

only computing the aggregation function when the group matches a criterion

If your goal is to avoid computation of the mean (or any other potentially computation expensive aggregation) in case the group does not match a condition, you could use apply:

df.resample('W')['a'].apply(lambda x: x.mean() if x.count()>=7 else None).dropna()

NB. In case of mean on this dataset this will likely be slower, especially since here you expect small groups (max. 7 days), but this might be better for other aggregation functions or very large groups.

CodePudding user response：

Another option, using pipe:

df.resample(on='W').a.pipe(lambda g: g.mean()[g.size() >= 7])
 
2021-01-10     7.0
2021-01-17    14.0
2021-01-24    21.0
2021-01-31    28.0
Freq: W-SUN, Name: a, dtype: float64