Home > OS >  Identifying duplicates with overlapping dates
Identifying duplicates with overlapping dates

Time:01-26

I have a dataset that looks like below. dates1 and dates2 correspond to a start date and end date of a program.

id <- c(1, 2, 3, 3, 4, 4, 5)
dates1 <- as.Date(c("2020-01-01", "2020-03-01", "2020-01-01", "2020-02-01", "2020-01-15", "2020-03-01", "2020-03-01"))
dates2 <- as.Date(c("2020-06-15", "2020-07-17", "2020-04-05","2020-05-06", "2020-02-25","2020-05-31", "2020-03-17"))

dfx <- data.frame(id, dates1, dates2)

I am trying to identify any duplicate ids in the dataset that also has overlapping date. So, from the above dataset, I want to pull out only the id 3 data since it's the only one with duplicate ids and overlapping dates.

I want the output to be something like this:

id_dup <- c(3,3)
dates1_dup <- as.Date(c("2020-01-01", "2020-02-01"))
dates2_dup <- as.Date(c("2020-04-05","2020-05-06"))

dfx_dup <- data.frame(id_dup, dates1_dup, dates2_dup)

Appreciate any help. Thanks!

CodePudding user response:

base R

dfx[ave(dfx$id, dfx$id,
    FUN = function(id) {
      any(with(dfx[dfx$id == id[1],],
               mapply(function(d1, d2) any(d1 > dates1 & d1 < dates2), dates1, dates2)))
    }) > 0,]
#   id     dates1     dates2
# 3  3 2020-01-01 2020-04-05
# 4  3 2020-02-01 2020-05-06

dplyr

library(dplyr)
dfx %>%
  group_by(id) %>%
  filter(
    any(mapply(function(d1, d2) any(d1 > dates1 & d1 < dates2), dates1, dates2))
  ) %>%
  ungroup()
# # A tibble: 2 x 3
#      id dates1     dates2    
#   <dbl> <date>     <date>    
# 1     3 2020-01-01 2020-04-05
# 2     3 2020-02-01 2020-05-06

CodePudding user response:

Another dplyr solution

dfx %>% 
  group_by(id) %>% 
  filter(n() > 1 & all(dates2 >= lead(dates1), na.rm = T))

     id dates1     dates2    
  <dbl> <date>     <date>    
1     3 2020-01-01 2020-04-05
2     3 2020-02-01 2020-05-06
  •  Tags:  
  • Related