How to count the number of unique values per group over the last n days-CodePudding

I have the pandas dataframe below:

groupId	date	value
1	2023-01-01	A
1	2023-01-05	B
1	2023-01-17	C
2	2023-01-01	A
2	2023-01-20	B
3	2023-01-01	A
3	2023-01-10	B
3	2023-01-12	C

I would like to do a groupby and count the number of unique values for each groupId but only looking at the last n=14 days, relatively to the date of the row.

What I would like as a result is something like this:

groupId	date	value	newColumn
1	2023-01-01	A	1
1	2023-01-05	B	2
1	2023-01-17	C	2
2	2023-01-01	A	1
2	2023-01-20	B	1
3	2023-01-01	A	1
3	2023-01-10	B	2
3	2023-01-12	C	3

I did try using a groupby(...).rolling('14d').nunique() and while the rolling function works on numeric fields to count and compute the mean, etc ... it doesn't work when used with nunique on string fields to count the number of unique string/object values.

You can use the code below to generate the dataframe.

pd.DataFrame(
{
 'groupId': [1, 1, 1, 2, 2, 3, 3, 3],
 'date': ['2023-01-01', '2023-01-05', '2023-01-17', '2023-01-01', '2023-01-20', '2023-01-01', '2023-01-10', '2023-01-12'], #YYYY-MM-DD
 'value': ['A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'],
 'newColumn': [1, 2, 2, 1, 1, 1, 2, 3]
}

)

Do you have an idea on how to solve this, even if not using the rolling function? That'd be much appreciated!

CodePudding user response：

Instead of nunique, you can also use count:

>>> (df.groupby('groupId').rolling('14D', on='date')['value'].count()
       .astype(int).rename('newColumn').reset_index())

   groupId       date  newColumn
0        1 2023-01-01          1
1        1 2023-01-05          2
2        1 2023-01-17          2
3        2 2023-01-01          1
4        2 2023-01-20          1
5        3 2023-01-01          1
6        3 2023-01-10          2
7        3 2023-01-12          3

Caveats: it can be complicated to merge this output with your original dataframe except if (groupId, date) is a unique combination.

Update

If your index is numeric (or create a dummy column monotonic increasing), you can use this trick:

sr = (df.reset_index().groupby('groupId').rolling('14D', on='date')
        .agg({'value': 'count', 'index': 'max'}).astype(int)
        .set_index('index')['value'])
df['newColumn'] = sr
print(df)

# Output
   groupId       date value  newColumn
0        1 2023-01-01     A          1
1        1 2023-01-05     B          2
2        1 2023-01-17     C          2
3        2 2023-01-01     A          1
4        2 2023-01-20     B          1
5        3 2023-01-01     A          1
6        3 2023-01-10     B          2
7        3 2023-01-12     C          3

Update 2

You can use pd.factorize to convert value column as numeric column:

>>> (df.assign(value=pd.factorize(df['value'])[0])
       .groupby('groupId').rolling('14D', on='date')['value']
       .apply(lambda x: x.nunique())
       .astype(int).rename('newColumn').reset_index())

   groupId       date  newColumn
0        1 2023-01-01          1
1        1 2023-01-05          2
2        1 2023-01-17          2
3        2 2023-01-01          1
4        2 2023-01-20          1
5        3 2023-01-01          1
6        3 2023-01-10          2
7        3 2023-01-12          3

CodePudding user response：

Another possible solution, which does not use rolling:

df['date'] = pd.to_datetime(df['date'])
df['new2'] = df.groupby('groupId')['date'].transform(
    lambda x: x.diff().dt.days.cumsum().le(14).mul(~x.duplicated()).cumsum() 1)

Output:

   groupId       date value  new2
0        1 2023-01-01     A     1
1        1 2023-01-05     B     2
2        1 2023-01-17     C     2
3        2 2023-01-01     A     1
4        2 2023-01-20     B     1
5        3 2023-01-01     A     1
6        3 2023-01-10     B     2
7        3 2023-01-12     C     3