Pandas time series elapsed time between events-CodePudding

I have two time series contained in a dataframe, similar to this :

events = {'date':['2021-12-31','2021-11-30','2021-10-31','2021-09-30','2021-08-31','2021-07-31','2021-06-30'], 
'serie_a':[0,0,1,0,0,0,1], 'serie_b':[0,0,0,1,0,1,0]}
 df = pd.DataFrame.from_dict(events)

date        serie_a  serie_b
2021-12-31        0        0
2021-11-30        0        0
2021-10-31        1        0
2021-09-30        0        1
2021-08-31        0        0
2021-07-31        0        1
2021-06-30        1        0

Series a & b represent the occurence of the same event, but from two different sources. I would like to quantify the lag between the two sources (basically to find if one is faster than the other). Now, finding the elapsed time for one series would be easy, but I can't seem to find the trick to compare both.

I've tried creating a dummy variable which would indicate :

{100:'No events', 110:'Event in serie_a', 111:'Event in serie a & b', ..}

date        serie_a serie_b  indicator
2021-12-31        0       0        100
2021-11-30        0       0        100
2021-10-31        1       0        110
2021-09-30        0       1        101
2021-08-31        0       0        100
2021-07-31        0       1        101
2021-06-30        1       0        110

And then create a function which would calculated the elapsed time between indicator = 110 and 101. Is this the right path?

serie_a or serie_b only contain binary values.
Events can happen in both series at the same time (indicator = 111).
Goal would be to have the elapsed time in days.
Both series are independant - they don't necessarily give out the same number of events.

Thanks!

Edit : Added bullet point infos

CodePudding user response：

If I understand correctly, you want to match events between the two columns and determine the difference in days to calculate if one column catches the events before the other.

You could try the following

Use datetime type, set date as index, and sort by date:

df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').sort_index()

Rank the events and mask the 0s:

df = df.cumsum().where(df.eq(1))

            event_a  event_b
date                        
2021-06-30      1.0      NaN
2021-07-31      NaN      1.0
2021-08-31      NaN      NaN
2021-09-30      NaN      2.0
2021-10-31      2.0      NaN
2021-11-30      NaN      NaN
2021-12-31      NaN      NaN

merge the events on same rank and compute the date difference:

s1 = df['event_a'].rename('event').dropna().reset_index()
s2 = df['event_b'].rename('event').dropna().reset_index()

df2 = (
 pd.merge(s1, s2, on='event', suffixes=('_a', '_b'))
   .assign(diff=lambda r: r['date_a']-r['date_b'])
 )

Now you get this kind of data:

      date_a  event     date_b     diff
0 2021-06-30    1.0 2021-07-31 -31 days
1 2021-10-31    2.0 2021-09-30  31 days

The "diff" column gives you, for each event, a negative delta if a is ahead, and positive if b is ahead.

Finally if you take the sum of diff, the sign of the result will give you an indication of which column is ahead overall (here equal):

df2['diff'].sum()
# Timedelta('0 days 00:00:00')