I have two time series contained in a dataframe, similar to this :
events = {'date':['2021-12-31','2021-11-30','2021-10-31','2021-09-30','2021-08-31','2021-07-31','2021-06-30'],
'serie_a':[0,0,1,0,0,0,1], 'serie_b':[0,0,0,1,0,1,0]}
df = pd.DataFrame.from_dict(events)
date serie_a serie_b
2021-12-31 0 0
2021-11-30 0 0
2021-10-31 1 0
2021-09-30 0 1
2021-08-31 0 0
2021-07-31 0 1
2021-06-30 1 0
Series a & b represent the occurence of the same event, but from two different sources. I would like to quantify the lag between the two sources (basically to find if one is faster than the other). Now, finding the elapsed time for one series would be easy, but I can't seem to find the trick to compare both.
I've tried creating a dummy variable which would indicate :
{100:'No events', 110:'Event in serie_a', 111:'Event in serie a & b', ..}
date serie_a serie_b indicator
2021-12-31 0 0 100
2021-11-30 0 0 100
2021-10-31 1 0 110
2021-09-30 0 1 101
2021-08-31 0 0 100
2021-07-31 0 1 101
2021-06-30 1 0 110
And then create a function which would calculated the elapsed time between indicator = 110 and 101. Is this the right path?
- serie_a or serie_b only contain binary values.
- Events can happen in both series at the same time (indicator = 111).
- Goal would be to have the elapsed time in days.
- Both series are independant - they don't necessarily give out the same number of events.
Thanks!
Edit : Added bullet point infos
CodePudding user response:
If I understand correctly, you want to match events between the two columns and determine the difference in days to calculate if one column catches the events before the other.
You could try the following
Use datetime type, set date as index, and sort by date:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').sort_index()
Rank the events and mask the 0s:
df = df.cumsum().where(df.eq(1))
event_a event_b
date
2021-06-30 1.0 NaN
2021-07-31 NaN 1.0
2021-08-31 NaN NaN
2021-09-30 NaN 2.0
2021-10-31 2.0 NaN
2021-11-30 NaN NaN
2021-12-31 NaN NaN
merge the events on same rank and compute the date difference:
s1 = df['event_a'].rename('event').dropna().reset_index()
s2 = df['event_b'].rename('event').dropna().reset_index()
df2 = (
pd.merge(s1, s2, on='event', suffixes=('_a', '_b'))
.assign(diff=lambda r: r['date_a']-r['date_b'])
)
Now you get this kind of data:
date_a event date_b diff
0 2021-06-30 1.0 2021-07-31 -31 days
1 2021-10-31 2.0 2021-09-30 31 days
The "diff" column gives you, for each event, a negative delta if a is ahead, and positive if b is ahead.
Finally if you take the sum of diff, the sign of the result will give you an indication of which column is ahead overall (here equal):
df2['diff'].sum()
# Timedelta('0 days 00:00:00')
