I have the following Dataframe:
Worker dt_diff same_employer same_role
1754 0 days 00:00:00 False False
2951 0 days 00:00:00 False False
2951 1 days 00:00:00 True True
2951 1 days 01:00:00 True True
3368 0 days 00:00:00 False False
3368 7 days 00:00:00 True True
3368 7 days 00:00:00 True True
3368 7 days 00:00:00 True True
3368 7 days 00:00:00 True True
3368 7 days 00:00:00 True True
3539 0 days 00:00:00 False False
3539 1 days 00:00:00 True True
3539 1 days 00:00:00 True True
3539 3 days 00:30:00 False False
3539 1 days 00:00:00 True True
3539 2 days 06:00:00 False True
I would like to create a new column containing continuity counter grouped by worker. However the counter will be based on the following conditions:
if (dt_diff > 6days) or (same_employer == False) or (same_role == False) then reset the counter
So for the above dataframe i would expect result as below:
Worker Counter
1754 1
2951 3
3368 1
3539 3
CodePudding user response:
You description is not highly explicit, but IIUC, you want the last continuity.
For this you can use boolean masks and groupby. Use cummin on the reversed boolean series to only keep the rows after the last False (add 1 to count it).
s = df['dt_diff'].lt('6d') & (df['same_employer'] | df['same_rosle'])
out = s.groupby(df['Worker']).apply(lambda x:x[::-1].cummin().sum() 1)
Output:
Worker
1754 1
2951 3
3368 1
3539 3
dtype: int64
CodePudding user response:
I expect your expected counter for the worker 3539 to be 1 because the last row should have reset it.
Your condition:
s = ~((df['dt_diff'].dt.days > 6) | (df['same_employer'] == False) | (df['same_role'] == False))
The key is to count from the last row up to the last row that does not satisfy your condition, and we can create a mask for that by:
y = s[::-1].groupby(df['Worker']).cumprod()
then we sum over the mask, but adding 1 at last
print(y.groupby(df['Worker']).sum() 1)
Worker
1754 1
2951 3
3368 1
3539 1
dtype: int64
