I have Pandas DataFrame with 3 columns:
df = pd.DataFrame({'product__sku': [1, 1, 1, 1, 2, 2],
'date': ['2021-10-01 20:48:12 00:00','2021-10-31 20:48:26 00:00',
'2021-09-01 20:48:12 00:00','2021-09-30 20:48:26 00:00',
'2021-10-01 12:23:17 00:00','2021-10-31 12:23:17 00:00'],
'qty': [100, 84, 5, 10, 15, 48]})
which looks like:
|product__sku | date | qty |
|1 | 2021-10-01 20:48:12 00:00 | 100 |
|1 | 2021-10-31 20:48:26 00:00 | 84 |
|1 | 2021-09-01 20:48:12 00:00 | 5 |
|1 | 2021-09-30 20:48:26 00:00 | 10 |
|2 | 2021-10-01 12:23:17 00:00 | 15 |
|2 | 2021-10-31 12:23:17 00:00 | 48 |
I need to group by two columns date (month period) and product__sku. During group_by I column 'qty' should be subtracted(diff) by formula max_date qty - min_date qty
In result I expect to see
|product__sku | date | diff |
|1 | 2021-09-30 20:48:12 00:00 | 5 |
|1 | 2021-10-31 20:48:12 00:00 | -16 |
|2 | 2021-10-31 20:48:26 00:00 | 33 |
I tried to use grouper
dg = df.groupby([ pd.Grouper('product__sku'), pd.Grouper(key='date', freq='1M')])['qty'].diff().fillna(0)
But got different result:
|0 0.0
| 1 -16.0
| 2 0.0
Name: qty, dtype: float64
CodePudding user response:
First groupby product__sku and month. Then define a custom function that finds the qty difference between the max and min dates in each group and apply it to each group:
def func(x):
dates = x['date'].sort_values()
diff = x.loc[dates.index[-1], 'qty'] - x.loc[dates.index[0], 'qty']
x = x[x['date']==dates.iloc[-1]]
x['diff'] = diff
return x[['product__sku','date','diff']]
df['date'] = pd.to_datetime(df['date'])
df = df.assign(month=df['date'].dt.month).groupby(['product__sku','month']).apply(func).reset_index(drop=True)
Output:
product__sku date diff
0 1 2021-09-30 20:48:26 00:00 5
1 1 2021-10-31 20:48:26 00:00 -16
2 2 2021-10-31 12:23:17 00:00 33
CodePudding user response:
Use GroupBy.agg with first and last in sorted DataFrame, so get values for minimal and maximal dates, last subtract values with DataFrame.pop for remove columns first, last:
If need last dates per groups use named aggregation also for date column:
df['date'] = pd.to_datetime(df['date'])
dg = (df.sort_values(['product__sku','date'])
.groupby(['product__sku', pd.Grouper(key='date', freq='1M')])
.agg(first=('qty','first'),last=('qty','last'), date=('date', 'first'))
.reset_index(level=-1, drop=True)
.reset_index()
)
dg['diff'] = dg.pop('last').sub(dg.pop('first'))
print (dg)
product__sku date diff
0 1 2021-09-01 20:48:12 00:00 5
1 1 2021-10-01 20:48:12 00:00 -16
2 2 2021-10-01 12:23:17 00:00 33
