I have two dataframes that I'd like to conditionally merge.
df1:
Location `Sub Location` Date n
<chr> <chr> <chr> <int>
1 AREA 1 Bore Drain 2014-04-21 15
2 AREA 1 Bore Drain 2014-04-23 2
3 AREA 1 Bore Drain 2014-04-24 6
4 AREA 1 Bore Drain 2015-04-04 6
5 AREA 1 Bore Drain 2015-04-08 8
6 AREA 1 Bore Drain 2015-04-09 9
7 AREA 1 Bore Drain 2016-03-25 31
8 AREA 1 Large Dam 2016-03-26 7
9 AREA 1 Bore Drain 2016-04-01 2
10 AREA 1 Bore Drain 2016-04-02 6
and df2:
Location `Sub Location` StartDate EndDate Totals
<chr> <chr> <chr> <chr> <dbl>
1 AREA 1 Homestead 2013-03-29 2013-03-30 0
2 AREA 1 Bore Drain 2014-04-21 2014-04-21 0
3 AREA 1 Homestead 2014-04-17 2014-04-18 0
4 AREA 1 Cottage 2014-04-21 2014-04-22 0
5 AREA 1 Bore Drain 2014-04-23 2014-04-24 0
6 AREA 1 Bore Drain 2015-04-03 2015-04-04 0
7 AREA 1 Homestead 2015-04-03 2015-04-04 0
8 AREA 1 Bore Drain 2015-04-08 2015-04-09 0
9 AREA 1 Cottage 2015-04-08 2015-04-09 0
10 AREA 1 Homestead 2016-03-25 2016-03-25 0
What I'd like to do is check for each entry in df1, if Date matches either StartDate OR EndDate, AND the location and Sub Location are the same. If this is the case, I'd like n in df1 to be added to Totals in df2
I've tried using ifelse() or going through every entry in a for loop, but I haven't been able to get it working. Any suggestions are appreciated :)
CodePudding user response:
Merge twice, first matching Date to StartDate, then to EndDate. For cases where StartDate == EndDate, it will merge in the same row from df1 twice, so delete the second n value. Then add the two merged-in n columns to Totals.
library(dplyr)
merged <- df2 %>%
left_join(df1, by = c("StartDate" = "Date", "Location", "SubLocation")) %>%
left_join(df1, by = c("EndDate" = "Date", "Location", "SubLocation")) %>%
mutate(
n.y = ifelse(StartDate == EndDate, NA, n.y),
Totals = rowSums(across(c(Totals, n.x, n.y)), na.rm = TRUE),
n.x = NULL,
n.y = NULL
)
merged
Output:
# A tibble: 10 × 5
Location SubLocation StartDate EndDate Totals
<chr> <chr> <date> <date> <dbl>
1 AREA 1 Homestead 2013-03-29 2013-03-30 0
2 AREA 1 BoreDrain 2014-04-21 2014-04-21 15
3 AREA 1 Homestead 2014-04-17 2014-04-18 0
4 AREA 1 Cottage 2014-04-21 2014-04-22 0
5 AREA 1 BoreDrain 2014-04-23 2014-04-24 8
6 AREA 1 BoreDrain 2015-04-03 2015-04-04 6
7 AREA 1 Homestead 2015-04-03 2015-04-04 0
8 AREA 1 BoreDrain 2015-04-08 2015-04-09 17
9 AREA 1 Cottage 2015-04-08 2015-04-09 0
10 AREA 1 Homestead 2016-03-25 2016-03-25 0
CodePudding user response:
Something like this
df2$Totals <- rowSums(t(apply(df2, 1, function(x)
ifelse(x["Location"] == df1$Location &
x["`Sub Location`"] == df1$"`Sub Location`" &
(x["StartDate"] == df1$Date | x["EndDate"] == df1$Date),
df1$n, 0))))
df2
Location `Sub Location` StartDate EndDate Totals
1 AREA 1 Homestead 2013-03-29 2013-03-30 0
2 AREA 1 Bore Drain 2014-04-21 2014-04-21 15
3 AREA 1 Homestead 2014-04-17 2014-04-18 0
4 AREA 1 Cottage 2014-04-21 2014-04-22 0
5 AREA 1 Bore Drain 2014-04-23 2014-04-24 8
6 AREA 1 Bore Drain 2015-04-03 2015-04-04 6
7 AREA 1 Homestead 2015-04-03 2015-04-04 0
8 AREA 1 Bore Drain 2015-04-08 2015-04-09 17
9 AREA 1 Cottage 2015-04-08 2015-04-09 0
10 AREA 1 Homestead 2016-03-25 2016-03-25 0
Data
df1 <- structure(list(Location = c("AREA 1", "AREA 1", "AREA 1", "AREA 1",
"AREA 1", "AREA 1", "AREA 1", "AREA 1", "AREA 1", "AREA 1"),
``Sub Location`` = c("Bore Drain", "Bore Drain", "Bore Drain",
"Bore Drain", "Bore Drain", "Bore Drain", "Bore Drain", "Large Dam",
"Bore Drain", "Bore Drain"), Date = c("2014-04-21", "2014-04-23",
"2014-04-24", "2015-04-04", "2015-04-08", "2015-04-09", "2016-03-25",
"2016-03-26", "2016-04-01", "2016-04-02"), n = c(15L, 2L,
6L, 6L, 8L, 9L, 31L, 7L, 2L, 6L)), class = "data.frame", row.names = c(NA,
-10L))
df2 <- structure(list(Location = c("AREA 1", "AREA 1", "AREA 1", "AREA 1",
"AREA 1", "AREA 1", "AREA 1", "AREA 1", "AREA 1", "AREA 1"),
``Sub Location`` = c("Homestead", "Bore Drain", "Homestead",
"Cottage", "Bore Drain", "Bore Drain", "Homestead", "Bore Drain",
"Cottage", "Homestead"), StartDate = c("2013-03-29", "2014-04-21",
"2014-04-17", "2014-04-21", "2014-04-23", "2015-04-03", "2015-04-03",
"2015-04-08", "2015-04-08", "2016-03-25"), EndDate = c("2013-03-30",
"2014-04-21", "2014-04-18", "2014-04-22", "2014-04-24", "2015-04-04",
"2015-04-04", "2015-04-09", "2015-04-09", "2016-03-25"),
Totals = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA,
-10L), class = "data.frame")
CodePudding user response:
Personally I prefer case_when. To me it is easier to understand. First full_join the 2 dataframes, then use case_when to the conditions you specified. Then distinct to remove duplicates
df1 %>%
full_join(df2) %>%
mutate(n = as.numeric(n),
Totals = as.numeric(Totals)) %>%
mutate(Totals = case_when(
Date == StartDate | Date == EndDate ~ Totals n,
TRUE ~ Totals
)) %>%
select(-n, -Date) %>%
drop_na() %>%
distinct(Location,Sub.Location,StartDate,EndDate, .keep_all = TRUE)
