Home > database >  Pandas: Calculate diff column grouped by date and additional column
Pandas: Calculate diff column grouped by date and additional column

Time:01-12

I have Pandas DataFrame with 3 columns:

df = pd.DataFrame({'product__sku': [1, 1, 1, 1, 2, 2],
                   'date': ['2021-10-01 20:48:12 00:00','2021-10-31 20:48:26 00:00',
                            '2021-09-01 20:48:12 00:00','2021-09-30 20:48:26 00:00',
                            '2021-10-01 12:23:17 00:00','2021-10-31 12:23:17 00:00'],
                   'qty': [100, 84, 5, 10, 15, 48]})

which looks like:

|product__sku | date                      |  qty |
|1            | 2021-10-01 20:48:12 00:00 |  100 |
|1            | 2021-10-31 20:48:26 00:00 |  84  |
|1            | 2021-09-01 20:48:12 00:00 |  5   |
|1            | 2021-09-30 20:48:26 00:00 |  10  |
|2            | 2021-10-01 12:23:17 00:00 |  15  |
|2            | 2021-10-31 12:23:17 00:00 |  48  |

I need to group by two columns date (month period) and product__sku. During group_by I column 'qty' should be subtracted(diff) by formula max_date qty - min_date qty

In result I expect to see

|product__sku | date                      | diff |
|1            | 2021-09-30 20:48:12 00:00 |  5   |
|1            | 2021-10-31 20:48:12 00:00 |  -16 |
|2            | 2021-10-31 20:48:26 00:00 |  33  |

I tried to use grouper

        dg = df.groupby([ pd.Grouper('product__sku'), pd.Grouper(key='date', freq='1M')])['qty'].diff().fillna(0)

But got different result:

|0     0.0
| 1   -16.0
| 2     0.0
Name: qty, dtype: float64

CodePudding user response:

First groupby product__sku and month. Then define a custom function that finds the qty difference between the max and min dates in each group and apply it to each group:

def func(x):
    dates = x['date'].sort_values()
    diff = x.loc[dates.index[-1], 'qty'] - x.loc[dates.index[0], 'qty']
    x = x[x['date']==dates.iloc[-1]]
    x['diff'] = diff
    return x[['product__sku','date','diff']]
    

df['date'] = pd.to_datetime(df['date'])
df = df.assign(month=df['date'].dt.month).groupby(['product__sku','month']).apply(func).reset_index(drop=True)

Output:

   product__sku                      date  diff
0             1 2021-09-30 20:48:26 00:00     5
1             1 2021-10-31 20:48:26 00:00   -16
2             2 2021-10-31 12:23:17 00:00    33

CodePudding user response:

Use GroupBy.agg with first and last in sorted DataFrame, so get values for minimal and maximal dates, last subtract values with DataFrame.pop for remove columns first, last:

If need last dates per groups use named aggregation also for date column:

df['date'] = pd.to_datetime(df['date'])

dg = (df.sort_values(['product__sku','date'])
        .groupby(['product__sku', pd.Grouper(key='date', freq='1M')])
        .agg(first=('qty','first'),last=('qty','last'), date=('date', 'first'))
        .reset_index(level=-1, drop=True)
        .reset_index()
        )
dg['diff'] = dg.pop('last').sub(dg.pop('first'))
print (dg)
   product__sku                      date  diff
0             1 2021-09-01 20:48:12 00:00     5
1             1 2021-10-01 20:48:12 00:00   -16
2             2 2021-10-01 12:23:17 00:00    33
  •  Tags:  
  • Related