I have a dataset that looks like below. dates1 and dates2 correspond to a start date and end date of a program.
id <- c(1, 2, 3, 3, 4, 4, 5)
dates1 <- as.Date(c("2020-01-01", "2020-03-01", "2020-01-01", "2020-02-01", "2020-01-15", "2020-03-01", "2020-03-01"))
dates2 <- as.Date(c("2020-06-15", "2020-07-17", "2020-04-05","2020-05-06", "2020-02-25","2020-05-31", "2020-03-17"))
dfx <- data.frame(id, dates1, dates2)
I am trying to identify any duplicate ids in the dataset that also has overlapping date. So, from the above dataset, I want to pull out only the id 3 data since it's the only one with duplicate ids and overlapping dates.
I want the output to be something like this:
id_dup <- c(3,3)
dates1_dup <- as.Date(c("2020-01-01", "2020-02-01"))
dates2_dup <- as.Date(c("2020-04-05","2020-05-06"))
dfx_dup <- data.frame(id_dup, dates1_dup, dates2_dup)
Appreciate any help. Thanks!
CodePudding user response:
base R
dfx[ave(dfx$id, dfx$id,
FUN = function(id) {
any(with(dfx[dfx$id == id[1],],
mapply(function(d1, d2) any(d1 > dates1 & d1 < dates2), dates1, dates2)))
}) > 0,]
# id dates1 dates2
# 3 3 2020-01-01 2020-04-05
# 4 3 2020-02-01 2020-05-06
dplyr
library(dplyr)
dfx %>%
group_by(id) %>%
filter(
any(mapply(function(d1, d2) any(d1 > dates1 & d1 < dates2), dates1, dates2))
) %>%
ungroup()
# # A tibble: 2 x 3
# id dates1 dates2
# <dbl> <date> <date>
# 1 3 2020-01-01 2020-04-05
# 2 3 2020-02-01 2020-05-06
CodePudding user response:
Another dplyr solution
dfx %>%
group_by(id) %>%
filter(n() > 1 & all(dates2 >= lead(dates1), na.rm = T))
id dates1 dates2
<dbl> <date> <date>
1 3 2020-01-01 2020-04-05
2 3 2020-02-01 2020-05-06
