I have a Pandas DataFrame that looks something like this:
activity_id start end type ... site heart_rate grp_idx date
user_id ...
72xyy89c74cc57178e02f103187ad579 dcb12345678b5c8e84cf2931b1a553cb 2015-09-12 00:40:33.171000 00:00 2015-09-12 00:53:33.171000 00:00 run ... data\dcb12345678b5c8e84cf2931b1a553cb.json NaN 2015-37 2015-09-12
The dtypes are:
activity_id object
start datetime64[ns, UTC]
end datetime64[ns, UTC]
type object
distance float64
steps float64
speed float64
pace float64
calories float64
ascent float64
descent float64
site object
heart_rate float64
grp_idx object
date object
I need to determine if there are x (e.g. 4) consecutive days in a row and find the number of times that has occurred. For example:
2015-09-12
2015-09-13
2015-09-14
2015-09-15
2015-09-16
2015-09-17
2015-09-18
2015-09-19
Would count as two.
I have tried using groupby, e.g.:
s = repeat_runner_df.groupby('user_id').start.diff().dt.days.fillna(1).cumsum()
repeat_runner_df.groupby(['user_id', s]).filter(lambda x: len(x) < 3)
print(repeat_runner_df)
but that has gotten me nowhere and neither has my Google skills. Any help would be greatly appreciated.
CodePudding user response:
Try this:
count = df.groupby(df.start.diff().ne(pd.Timedelta(days=1)).cumsum()).apply(len).ge(4).sum()
Output:
>>> count
3
CodePudding user response:
I think this is what youre looking for:
df.loc[df.groupby(['user_id',df.groupby('user_id')['start'].transform(lambda x: x.diff().dt.days.ne(1).cumsum())])['user_id'].transform('count').ge(4)]
