Output missing dates by group of columns-CodePudding

I have time series of y per store and product stored in the following dataframe:

            ds store  product y
0   2016-01-01    a    salt   2
1   2016-01-02    a    salt   5
2   2016-01-04    a    salt   3
3   2016-01-05    a    salt   3
4   2016-01-06    a    salt   4
5   2016-01-07    a    salt   3
6   2016-01-01    b    pepper 2
7   2016-01-02    b    pepper 2
8   2016-01-03    b    pepper 1
9   2016-01-04    b    pepper 2
10  2016-01-06    b    pepper 4
11  2016-01-07    b    pepper 2

I would like to output all the missing dates per store, product and return the following result:

    ds store  product
0   2016-01-03    a    salt
1   2016-01-05    b    pepper

CodePudding user response：

Use groupby_resample:

# Assuming ds is datetime64 else use:
# df['ds'] = pd.to_datetime(df['ds'])
out = df.groupby(['store', 'product']).resample('D', on='ds')['y'] \
        .first().loc[lambda x: x.isna()].index.to_frame(index=False)
print(out)

# Output
  store product         ds
0     a    salt 2016-01-03
1     b  pepper 2016-01-05

Details:

>>> df.groupby(['store', 'product']).resample('D', on='ds')['y'].first()
store  product  ds        
a      salt     2016-01-01    2.0
                2016-01-02    5.0
                2016-01-03    NaN  # <- missing value == missing date
                2016-01-04    3.0
                2016-01-05    3.0
                2016-01-06    4.0
                2016-01-07    3.0
b      pepper   2016-01-01    2.0
                2016-01-02    2.0
                2016-01-03    1.0
                2016-01-04    2.0
                2016-01-05    NaN  # <- missing value == missing date
                2016-01-06    4.0
                2016-01-07    2.0
Name: y, dtype: float64

Update: If you have a date in the ds column without a value in the y column, just use fillna({'y': 0}) before groupby_resample