Home > Mobile >  How to pre-filter to a minimum number of rows per group before resampling without double-resampling?
How to pre-filter to a minimum number of rows per group before resampling without double-resampling?

Time:01-13

This is a mock of my actual use-case, suppose I have the following:

import pandas as pd
import numpy as np

df = pd.DataFrame(data=np.arange(1, 32), 
                  columns=['a'], 
                  index=pd.date_range('2021-01-01', '2021-01-31'))
  

I'd like to pre-filter resampling groups to a minimum number of X rows before applying a function. I could find the number of rows per resampling group as follows:

df.resample('W').apply(lambda x: x.shape[0])
>              a
> 2021-01-03   3
> 2021-01-10   7
> 2021-01-17   7
> 2021-01-24   7
> 2021-01-31   7

In that example above I would like to run my resampling excluding or filtering out the subgroup corresponding to date 2021-01-03 because it only has 3 rows, periods or days.

The desired output is, for example, taking mean, it should exclude the case of 2021-01-03 BUT before applying the actual function, not after:

df.resample('W').filter_here_to_min_7_rows().apply(lambda x: x.mean())
>                 a
> 2021-01-10      7
> 2021-01-17      14
> 2021-01-24      21
> 2021-01-31      28

UPDATE More context, why do I need this? in my real use-case the apply function is much more complex and if the minimum number of rows is not satisfied, I need to handle a border case providing an empty result which impacts the aggregated resampling type result. Therefore I'd prefer to have the apply function getting correct input groups.

CodePudding user response:

You can use loc with a callable to filter the aggregated rows:

df.resample('W')['a'].agg(['count', 'mean']).loc[lambda df: df['count'] >= 7, 'mean']

CodePudding user response:

You can also reuse Resampler/GroupBy objects:

g = df.resample('W')['a']
g.mean()[g.count().ge(7)]

output:

2021-01-10     7.0
2021-01-17    14.0
2021-01-24    21.0
2021-01-31    28.0
Freq: W-SUN, Name: a, dtype: float64
only computing the aggregation function when the group matches a criterion

If your goal is to avoid computation of the mean (or any other potentially computation expensive aggregation) in case the group does not match a condition, you could use apply:

df.resample('W')['a'].apply(lambda x: x.mean() if x.count()>=7 else None).dropna()

NB. In case of mean on this dataset this will likely be slower, especially since here you expect small groups (max. 7 days), but this might be better for other aggregation functions or very large groups.

CodePudding user response:

Another option, using pipe:

df.resample(on='W').a.pipe(lambda g: g.mean()[g.size() >= 7])
 
2021-01-10     7.0
2021-01-17    14.0
2021-01-24    21.0
2021-01-31    28.0
Freq: W-SUN, Name: a, dtype: float64
  •  Tags:  
  • Related