I have a df like this:
| date | name | amount |
|---|---|---|
| 2021-07-01 | 'Chlorox' | 1 |
| 2021-07-14 | 'Chlorox' | 20 |
| 2021-07-29 | 'Chlorox' | 700 |
| 2021-08-11 | 'Chlorox' | 6000 |
| 2021-08-12 | 'Suriname' | 3 |
| 2021-08-19 | 'Suriname' | 10 |
and I'd like the sum of amounts within a one month period, plus all the dates that fall within that month range. So something like these results:
| name | sum | dates |
|---|---|---|
| 'Chlorox' | 721 | ['2021-07-01', '2021-07-14', '2021-07-29'] |
| 'Chlorox' | 6720 | '2021-07-14', '2021-07-29', '2021-08-11' |
| 'Suriname' | 13 | ['2021-08-12', '2021-08-19'] |
I've been tinkering around with rolling() and groupby, but I've been struggling and unable to get overlapping dates!
CodePudding user response:
Use pd.to_datetime, Series.dt.to_period with Groupby.agg:
In [874]: df['date'] = pd.to_datetime(df['date']) # Convert date column to pandas datetime
In [923]: res = df.groupby(['name', df['date'].dt.to_period('M')], as_index=False).agg({'amount': sum, 'date': lambda x: list(x.dt.date)})
In [924]: res
Out[924]:
name amount date
0 'Chlorox' 721 [2021-07-01, 2021-07-14, 2021-07-29]
1 'Chlorox' 6000 [2021-08-11]
2 'Suriname' 13 [2021-08-12, 2021-08-19]
CodePudding user response:
I think the objective is to group by the columns based on the month and name. So the result dataframe will look like this -
date amount
name Month
Chlorox 7 2021-07-01, 2021-07-14, 2021-07-29 721
8 2021-08-11 6000
Suriname 8 2021-08-12, 2021-08-19 13
If that is the case - then here is a solution.
>>> import numpy as np
>>> import pandas as pd
>>> d = {
'date' : ['2021-07-01', '2021-07-14', '2021-07-29', '2021-08-11', '2021-08-12', '2021-08-19', '2020-08-11', '2020-08-12', '2020-08-19'],
'name' : ['Chlorox', 'Chlorox', 'Chlorox', 'Chlorox', 'Suriname', 'Suriname','Chlorox', 'Suriname', 'Suriname'],
'amount': [1,20,700,6000,3,10,6500,300,150]
}
>>> df = pd.DataFrame(d)
We will add a "Datetime" (dtype=datetime) column and a month (dtype=int) and a year column
>>> df['Datetime'] = pd.to_datetime( df.date)
>>> df['year'] = pd.DatetimeIndex(df['Datetime']).year
>>> df['month'] = pd.DatetimeIndex(df['Datetime']).month
>>> print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 9 non-null object
1 name 9 non-null object
2 amount 9 non-null int64
3 Datetime 9 non-null datetime64[ns]
4 month 9 non-null int64
5 year 9 non-null int64
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 560.0 bytes
None
>>> print(df)
date name amount Datetime month year
0 2021-07-01 Chlorox 1 2021-07-01 7 2021
1 2021-07-14 Chlorox 20 2021-07-14 7 2021
2 2021-07-29 Chlorox 700 2021-07-29 7 2021
3 2021-08-11 Chlorox 6000 2021-08-11 8 2021
4 2021-08-12 Suriname 3 2021-08-12 8 2021
5 2021-08-19 Suriname 10 2021-08-19 8 2021
6 2020-08-11 Chlorox 6500 2020-08-11 8 2020
7 2020-08-12 Suriname 300 2020-08-12 8 2020
8 2020-08-19 Suriname 150 2020-08-19 8 2020
>>> # we will have this aggregate function to deal with concatenating string values together, and, for numerical additions we will use numpy.sum()
>>> def f1(s):
return ', '.join(s)
>>> df_result = df.groupby(['name', 'year', 'month']).agg({'date': f1, 'amount': np.sum})
>>> print(df_result)
date amount
name year month
Chlorox 2020 8 2020-08-11 6500
2021 7 2021-07-01, 2021-07-14, 2021-07-29 721
8 2021-08-11 6000
Suriname 2020 8 2020-08-12, 2020-08-19 450
2021 8 2021-08-12, 2021-08-19 13
See if this helps or leave a comment
