I have a data frame where I need to pair consecutive events within the same day and subtract the later from the former. Each has a timestamp and a date.
time date event score
0 2022-03-07 06:45:00 00:00 2022-03-07 light 80.066667
1 2022-03-07 18:12:00 00:00 2022-03-07 dark 79.857667
2 2022-03-30 06:25:00 00:00 2022-03-30 light 107.060833
3 2022-03-30 13:38:00 00:00 2022-03-30 dark 105.324000
4 2022-03-30 13:40:00 00:00 2022-03-30 dark 105.239750
5 2022-03-30 15:47:00 00:00 2022-03-30 light 106.863143
6 2022-04-01 06:25:00 00:00 2022-04-01 light 101.271867
I have tried spreading the data frame using
df = df.pivot(index='time', columns='event', values='score')
event light dark
time
2022-03-07 06:45:00 00:00 80.066667 NaN
2022-03-07 18:12:00 00:00 NaN 79.857667
2022-03-30 06:25:00 00:00 107.060833 NaN
2022-03-30 13:38:00 00:00 NaN 105.324000
2022-03-30 13:40:00 00:00 NaN 105.239750
2022-03-30 15:47:00 00:00 106.863143 NaN
2022-04-01 06:25:00 00:00 101.271867 NaN
however because the events happen at different times the spread data frame has NaNs. I would ideally end up with this, where I keep the time of the first occurrence in the pair (light or dark), align the events (note: corresponding dark to match light as not yet occurred for 2022-04-01) and when light comes first I subtract the later value from the former and when dark comes first I subtract the former value from the later.
event light dark diff
time
2022-03-07 06:45:00 00:00 80.066667 79.857667 -0.208999
2022-03-30 06:25:00 00:00 107.060833 105.324000 -1.7368
2022-03-30 13:40:00 00:00 106.863143 105.239750 -1.6233
2022-04-01 06:25:00 00:00 101.271867 NaN NaN
CodePudding user response:
Here's one way using groupby cumcount to create groups, then use that groups in groupby first to get the first time each event happens each day. Then pivot.
Finally, use diff to get the difference between "light" and "dark" and assign the differences to column "diff" in df:
out = (df.assign(time=df.groupby(df.groupby('event').cumcount())['time'].transform('first'))
.pivot('time', 'event', 'score').reset_index().rename_axis([None], axis=1)
.assign(diff=lambda x: x['dark']-x['light']))
Output:
time dark light diff
0 2022-03-07 06:45:00 00:00 79.857667 80.066667 -0.209000
1 2022-03-30 06:25:00 00:00 105.324000 107.060833 -1.736833
2 2022-03-30 13:40:00 00:00 105.239750 106.863143 -1.623393
3 2022-04-01 06:25:00 00:00 NaN 101.271867 NaN
