Here's an example that fulfills the criteria. Dataframe df1 and df2 both have id columns and date columns. In df1 the id column is unique while in df2 it is non-unique. I'd like to create a new dataframe where the join happens if df1.id == df2.id and some pair of dates in df2 is within 1 week of the date in df1
df1
| customer_id (unique) | 'purchase_date'|
| -------- | -------------- |
| 1 | 2021-05-14 |
| 2 | 2021-09-16 |
df2
| customer_id | 'visit_dates' |
| -------- | -------------- |
| 1 | 2021-05-11 |
| 1 | 2021-05-16 |
| 1 | 2021-05-21 |
| 2 | 2021-07-14 |
| 2 | 2021-09-17 |
# New Df will only have 1 row.
# For customer 1 there is a date within the range(05-07 -> 05-14)
# and within the range(05-14 -> 05-21) with matching id.
# For customer 2, there are no dates within (09-09 -> 09-16) so it should be filtered
newdf
| customer_id (unique) | 'purchase_date'| begin_date_range | end_date_range
| -------- | -------------- | ---------------- | -------------
| 1 | 2021-05-14 | 2021-05-11 | 2021-05-16
I understand how to do this in SQL, but I don't know what functions allow similar date predicate filtering in Pandas.
CodePudding user response:
Filter the merge result using the week condition.
df = df1.merge(df2, on='customer_id', how='left')
df[(df['purchase_date'] - df['visit_dates']).dt.days.between(0, 6)]
df['visit_dates 1week'] = df['visit_dates'] pd.Timedelta(days=6)
CodePudding user response:
Building on @Raymond Kwok's excellent answer: We could merge twice, once to left merge df1 to df2 and then merge the "begin_date_range" part with the "end_date_range" part
merged = df1.merge(df2, on='customer_id', how='left')
merged['purchase_date'] = pd.to_datetime(merged['purchase_date'])
merged['visit_dates'] = pd.to_datetime(merged['visit_dates'])
day_diff = merged['purchase_date'].sub(merged['visit_dates']).dt.days
out = (merged[day_diff.between(0,6)]
.merge(merged[day_diff.between(-6,0)], on=['customer_id','purchase_date'])
.rename(columns={'visit_dates_x': 'begin_date_range', 'visit_dates_y': 'end_date_range'}))
Output:
customer_id purchase_date begin_date_range end_date_range
0 1 2021-05-14 2021-05-11 2021-05-16
