I have the pandas dataframe below:
| groupId | date | value |
|---|---|---|
| 1 | 2023-01-01 | A |
| 1 | 2023-01-05 | B |
| 1 | 2023-01-17 | C |
| 2 | 2023-01-01 | A |
| 2 | 2023-01-20 | B |
| 3 | 2023-01-01 | A |
| 3 | 2023-01-10 | B |
| 3 | 2023-01-12 | C |
I would like to do a groupby and count the number of unique values for each groupId but only looking at the last n=14 days, relatively to the date of the row.
What I would like as a result is something like this:
| groupId | date | value | newColumn |
|---|---|---|---|
| 1 | 2023-01-01 | A | 1 |
| 1 | 2023-01-05 | B | 2 |
| 1 | 2023-01-17 | C | 2 |
| 2 | 2023-01-01 | A | 1 |
| 2 | 2023-01-20 | B | 1 |
| 3 | 2023-01-01 | A | 1 |
| 3 | 2023-01-10 | B | 2 |
| 3 | 2023-01-12 | C | 3 |
I did try using a groupby(...).rolling('14d').nunique() and while the rolling function works on numeric fields to count and compute the mean, etc ... it doesn't work when used with nunique on string fields to count the number of unique string/object values.
You can use the code below to generate the dataframe.
pd.DataFrame(
{
'groupId': [1, 1, 1, 2, 2, 3, 3, 3],
'date': ['2023-01-01', '2023-01-05', '2023-01-17', '2023-01-01', '2023-01-20', '2023-01-01', '2023-01-10', '2023-01-12'], #YYYY-MM-DD
'value': ['A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'],
'newColumn': [1, 2, 2, 1, 1, 1, 2, 3]
}
)
Do you have an idea on how to solve this, even if not using the rolling function? That'd be much appreciated!
CodePudding user response:
Instead of nunique, you can also use count:
>>> (df.groupby('groupId').rolling('14D', on='date')['value'].count()
.astype(int).rename('newColumn').reset_index())
groupId date newColumn
0 1 2023-01-01 1
1 1 2023-01-05 2
2 1 2023-01-17 2
3 2 2023-01-01 1
4 2 2023-01-20 1
5 3 2023-01-01 1
6 3 2023-01-10 2
7 3 2023-01-12 3
Caveats: it can be complicated to merge this output with your original dataframe except if (groupId, date) is a unique combination.
Update
If your index is numeric (or create a dummy column monotonic increasing), you can use this trick:
sr = (df.reset_index().groupby('groupId').rolling('14D', on='date')
.agg({'value': 'count', 'index': 'max'}).astype(int)
.set_index('index')['value'])
df['newColumn'] = sr
print(df)
# Output
groupId date value newColumn
0 1 2023-01-01 A 1
1 1 2023-01-05 B 2
2 1 2023-01-17 C 2
3 2 2023-01-01 A 1
4 2 2023-01-20 B 1
5 3 2023-01-01 A 1
6 3 2023-01-10 B 2
7 3 2023-01-12 C 3
Update 2
You can use pd.factorize to convert value column as numeric column:
>>> (df.assign(value=pd.factorize(df['value'])[0])
.groupby('groupId').rolling('14D', on='date')['value']
.apply(lambda x: x.nunique())
.astype(int).rename('newColumn').reset_index())
groupId date newColumn
0 1 2023-01-01 1
1 1 2023-01-05 2
2 1 2023-01-17 2
3 2 2023-01-01 1
4 2 2023-01-20 1
5 3 2023-01-01 1
6 3 2023-01-10 2
7 3 2023-01-12 3
CodePudding user response:
Another possible solution, which does not use rolling:
df['date'] = pd.to_datetime(df['date'])
df['new2'] = df.groupby('groupId')['date'].transform(
lambda x: x.diff().dt.days.cumsum().le(14).mul(~x.duplicated()).cumsum() 1)
Output:
groupId date value new2
0 1 2023-01-01 A 1
1 1 2023-01-05 B 2
2 1 2023-01-17 C 2
3 2 2023-01-01 A 1
4 2 2023-01-20 B 1
5 3 2023-01-01 A 1
6 3 2023-01-10 B 2
7 3 2023-01-12 C 3
