Let's say we have the following pandas dataframe, working on python:
| worker | shift_start | shift_end | function |
|---|---|---|---|
| Alice | 2022-01-15 11:30:00 | 2022-01-15 15:30:00 | A |
| Alice | 2022-01-15 17:30:00 | 2022-01-15 18:29:59 | A |
| Alice | 2022-01-15 18:30:00 | 2022-01-15 20:30:00 | B |
| Alice | 2022-01-16 10:30:00 | 2022-01-16 11:29:59 | B |
| Alice | 2022-01-65 12:30:00 | 2022-01-16 15:30:00 | A |
| Bob | 2022-01-15 10:30:00 | 2022-01-15 12:29:59 | B |
| Bob | 2022-01-15 12:30:00 | 2022-01-15 14:30:00 | A |
| Bob | 2022-01-15 15:30:00 | 2022-01-15 18:30:00 | A |
| Bob | 2022-01-17 10:30:00 | 2022-01-17 15:30:00 | A |
| Bob | 2022-01-17 16:30:00 | 2022-01-17 18:30:00 | B |
Where shift_start and shift_end are datetime columns. Each worker has their own schedule, with a function that may vary (irrelevant here). I would like to compute the break time between days, this is, the time difference between the end of the last shift and the beginning of the next one the next day they work:
| worker | shift_start | shift_end | function | break_time |
|---|---|---|---|---|
| Alice | 2022-01-15 11:30:00 | 2022-01-15 15:30:00 | A | NaN |
| Alice | 2022-01-15 17:30:00 | 2022-01-15 18:29:59 | A | NaN |
| Alice | 2022-01-15 18:30:00 | 2022-01-15 20:30:00 | B | NaN |
| Alice | 2022-01-16 10:30:00 | 2022-01-16 11:29:59 | B | 0 days 14 hours |
| Alice | 2022-01-16 12:30:00 | 2022-01-16 15:30:00 | A | 0 days 14 hours |
| Bob | 2022-01-15 10:30:00 | 2022-01-15 12:29:59 | B | NaN |
| Bob | 2022-01-15 12:30:00 | 2022-01-15 14:30:00 | A | NaN |
| Bob | 2022-01-15 15:30:00 | 2022-01-15 18:30:00 | A | NaN |
| Bob | 2022-01-17 10:30:00 | 2022-01-17 15:30:00 | A | 1 days 16 hours |
| Bob | 2022-01-17 16:30:00 | 2022-01-17 18:30:00 | B | 1 days 16 hours |
In both cases, January 15th has "NaN" because we do not know the previous shift. For Alice, there have been 14 hours between the end of her shift at 20:30h and the start of the next one the day after at 10:30. In the case of Bob, however, there is a gap day between, so the difference will be 1 day and 16 hours. This break_time column should be repeated for all the entries in a given day. I am unable to find a way to groupby the worker and then compute the last and first shift for each day. Any help would be much appreciated.
CodePudding user response:
I'm not completely sure about the structure of your data, but you could try the following (with df your dataframe):
shifted_end = df.groupby("worker")["shift_end"].shift()
m = df["shift_start"].dt.date != shifted_end.dt.date
df["break_time"] = (
(df["shift_start"] - shifted_end)[m].reindex_like(df)
.groupby(df["worker"]).ffill()
)
- First build a series
shifted_endwith shifted values from the columnshift_end. This has to be done grouped byworkerto avoid mixing data from different workers. - Now build a mask
mto identify the day breaks by comparing the.dt.dateversions ofshift_startandshifted_end. - Then subtract the shifted
shift_endcolumn from theshift_startcolumn and extract only the results at the day breaksm. Use.reindex_like(df)to get the full index back, withNaNs at the recovered indices. Then fill theNaNs forward to propagate the day break values. This, again, has to be done grouped byworker- otherwise the start of each worker group would get the last value from the worker before.
Result for your sample is:
worker shift_start shift_end function break_time
0 Alice 2022-01-15 11:30:00 2022-01-15 15:30:00 A NaT
1 Alice 2022-01-15 17:30:00 2022-01-15 18:29:59 A NaT
2 Alice 2022-01-15 18:30:00 2022-01-15 20:30:00 B NaT
3 Alice 2022-01-16 10:30:00 2022-01-16 11:29:59 B 0 days 14:00:00
4 Alice 2022-01-16 12:30:00 2022-01-16 15:30:00 A 0 days 14:00:00
5 Bob 2022-01-15 10:30:00 2022-01-15 12:29:59 B NaT
6 Bob 2022-01-15 12:30:00 2022-01-15 14:30:00 A NaT
7 Bob 2022-01-15 15:30:00 2022-01-15 18:30:00 A NaT
8 Bob 2022-01-17 10:30:00 2022-01-17 15:30:00 A 1 days 16:00:00
9 Bob 2022-01-17 16:30:00 2022-01-17 18:30:00 B 1 days 16:00:00
