Home > Software engineering >  Left join two dateframes with date columns on a range of dates?
Left join two dateframes with date columns on a range of dates?

Time:02-10

Here's an example that fulfills the criteria. Dataframe df1 and df2 both have id columns and date columns. In df1 the id column is unique while in df2 it is non-unique. I'd like to create a new dataframe where the join happens if df1.id == df2.id and some pair of dates in df2 is within 1 week of the date in df1

df1

| customer_id (unique) | 'purchase_date'|
| --------             | -------------- |
| 1                    | 2021-05-14     |
| 2                    | 2021-09-16     |

df2

| customer_id | 'visit_dates'  |
| --------    | -------------- |
| 1           | 2021-05-11     |
| 1           | 2021-05-16     |
| 1           | 2021-05-21     |
| 2           | 2021-07-14     |
| 2           | 2021-09-17     |


# New Df will only have 1 row.
# For customer 1 there is a date within the range(05-07 -> 05-14) 
# and within the range(05-14 -> 05-21) with matching id. 
# For customer 2, there are no dates within (09-09 -> 09-16) so it should be filtered

newdf

| customer_id (unique) | 'purchase_date'| begin_date_range | end_date_range
| --------             | -------------- | ---------------- | -------------
| 1                    | 2021-05-14     | 2021-05-11       | 2021-05-16

I understand how to do this in SQL, but I don't know what functions allow similar date predicate filtering in Pandas.

CodePudding user response:

Filter the merge result using the week condition.

df = df1.merge(df2, on='customer_id', how='left')
df[(df['purchase_date'] - df['visit_dates']).dt.days.between(0, 6)]
df['visit_dates 1week'] = df['visit_dates']   pd.Timedelta(days=6)

CodePudding user response:

Building on @Raymond Kwok's excellent answer: We could merge twice, once to left merge df1 to df2 and then merge the "begin_date_range" part with the "end_date_range" part

merged = df1.merge(df2, on='customer_id', how='left')
merged['purchase_date'] = pd.to_datetime(merged['purchase_date'])
merged['visit_dates'] = pd.to_datetime(merged['visit_dates'])

day_diff = merged['purchase_date'].sub(merged['visit_dates']).dt.days
out = (merged[day_diff.between(0,6)]
       .merge(merged[day_diff.between(-6,0)], on=['customer_id','purchase_date'])
       .rename(columns={'visit_dates_x': 'begin_date_range', 'visit_dates_y': 'end_date_range'}))

Output:

   customer_id purchase_date begin_date_range end_date_range
0            1    2021-05-14       2021-05-11     2021-05-16
  •  Tags:  
  • Related