I have two datasets allocated to pandas dataframes DF1,DF2. One with historical customer transactions (DF2) and another with other customer activities (DF1), I need to first keep only first occurances in DF2 to avoid any mistakes, then add a column ['existed_before'] to DF1 with input as True/False after comparing ID's and dates in both dataframes. The logic I have in mind is something like the following: if DF1['id'] == to DF2['ID'] and DF1['date'] > DF2['date'] then True, else it should be False or Nan in case the ID didn't exist in DF2 at all.
Below is a demonstration of the desired output, *Notes: code should consider both id and full date (including hours/mins/sec). Also, an ID might repeat many times in both dataframes.
df1:
id date
1 19-01-2021 00:13:48
2 04-03-2021 06:23:21
3 06-05-2021 08:44:00
3 23-11-2021 23:18:18
4 15-12-2021 03:20:13
df2:
id date
1 19-01-2021 00:13:48
2 02-01-2021 09:27:23
3 06-05-2021 08:44:00
df1:
id date existed_before
1 19-01-2021 00:13:48 False
2 04-03-2021 06:23:21 True
3 06-05-2021 08:44:00 False
3 23-11-2021 23:18:18 True
4 15-12-2021 03:20:13 NaN
CodePudding user response:
I got different ouput in merge_asof with helper column:
df1['date'] = pd.to_datetime(df1['date'], dayfirst=True)
df2['date'] = pd.to_datetime(df2['date'], dayfirst=True)
df1 = df1.sort_values('date')
df2 = df2.sort_values('date')
df = pd.merge_asof(df1,
df2.assign(first_time=True),
on='date',
by='id',
allow_exact_matches=False)
df['first_time'] = (df['first_time'].fillna(False)
.astype('boolean')
.where(df['id'].isin(df2['id'])))
print (df)
id date first_time
0 1 2021-01-19 00:13:48 False
1 2 2021-03-04 06:23:21 True
2 3 2021-05-06 08:44:00 False
3 3 2021-11-23 23:18:18 True
4 4 2021-12-15 03:20:13 <NA>
