Hi I have two data frames.
Both with ID variable and date variable. I want to join by ID and date, but dates can be joined by plus or minus 3 days each way. The order of first data frame need to remain. The two data frames have different lengths and not all dates or IDs will line up. IDs can also have 2 or more entries as different collections on different dates.
Hope that makes sense.
1st Data frame -
structure(list(ID = c(1, 1, 1, 2, 3, 3, 3, 4, 5, 5, 5, 6, 6,
6, 6, 7), `Date Collected` = structure(c(18903, 18936, 18976,
18539, 18508, 18581, 18655, 17688, 17975, 18037, 18297, 18081,
18242, 18338, 18721, 18128), class = "Date")), row.names = c(NA,
-16L), class = c("tbl_df", "tbl", "data.frame"))
2nd Data frame -
structure(list(ID = c(1, 1, 1, 1, 3, 3, 3, 4, 4, 4, 5), `Date Relapse` = structure(c(18900,
18938, 18973, 19004, 18511, 18578, 18657, 17322, 17685, 18129,
18300), class = "Date")), row.names = c(NA, -11L), class = c("tbl_df",
"tbl", "data.frame"))
CodePudding user response:
Here's a tidyverse solution.
group_modify takes a grouped tibble and applies a function to each of the groups. The function should take two arguments. The first supplies the rows in the current group (excluding the columns that define the group, the second is a single column tibble with one column for each column that defines the group. The values in the one row of this tibble contain the values that define the current group.
As you haven't defined how to resolve ties, I've simply taken the row in df2 that is closest to the Date Collected of df1. If two rows in df2 are equidistant to Date Collected, the earlier is taken.
# Modify to allow validation
df2 <- df2 %>%
mutate(Row2=row_number())
df1 %>%
group_by(ID, `Date Collected`) %>%
group_modify(
function(.x, .y) {
t <- df2 %>%
filter(ID == .y$ID) %>%
add_column(Delta=abs(as.numeric(.y$`Date Collected` - .$`Date Relapse`))) %>%
arrange(Delta) %>%
filter(Delta <= 3) %>%
head(1) %>%
select(-ID)
if (t %>% nrow() > 0) {
.x %>% bind_cols(t)
} else {
.x
}
}
)
# A tibble: 16 × 5
# Groups: ID, Date Collected [16]
ID `Date Collected` `Date Relapse` Row2 Delta
<dbl> <date> <date> <int> <dbl>
1 1 2021-10-03 2021-09-30 1 3
2 1 2021-11-05 2021-11-07 2 2
3 1 2021-12-15 2021-12-12 3 3
4 2 2020-10-04 NA NA NA
5 3 2020-09-03 2020-09-06 5 3
6 3 2020-11-15 2020-11-12 6 3
7 3 2021-01-28 2021-01-30 7 2
8 4 2018-06-06 2018-06-03 9 3
9 5 2019-03-20 NA NA NA
10 5 2019-05-21 NA NA NA
11 5 2020-02-05 2020-02-08 11 3
12 6 2019-07-04 NA NA NA
13 6 2019-12-12 NA NA NA
14 6 2020-03-17 NA NA NA
15 6 2021-04-04 NA NA NA
16 7 2019-08-20 NA NA NA
CodePudding user response:
There is a simpler way - use non-equi joins
library(data.table)
setDT(df2)[,`:=`(d_lower = `Date Relapse`-3, d_upper = `Date Relapse` 3)]
df2[
i=setDT(df1),
on = .(ID, d_lower<=`Date Collected`, d_upper>=`Date Collected`),
j=.(ID, "Date Collected"=d_lower, `Date Relapse`)
]
If you prefer dplyr, just join on ID, and then filter as needed
left_join(df1,inner_join(df1,df2, by="ID") %>%
filter(abs(`Date Collected`-`Date Relapse`)<=3),
by=c("ID", "Date Collected")
)
Either way, output:
ID Date Collected Date Relapse
1: 1 2021-10-03 2021-09-30
2: 1 2021-11-05 2021-11-07
3: 1 2021-12-15 2021-12-12
4: 2 2020-10-04 <NA>
5: 3 2020-09-03 2020-09-06
6: 3 2020-11-15 2020-11-12
7: 3 2021-01-28 2021-01-30
8: 4 2018-06-06 2018-06-03
9: 5 2019-03-20 <NA>
10: 5 2019-05-21 <NA>
11: 5 2020-02-05 2020-02-08
12: 6 2019-07-04 <NA>
13: 6 2019-12-12 <NA>
14: 6 2020-03-17 <NA>
15: 6 2021-04-04 <NA>
16: 7 2019-08-20 <NA>
