R Populate column based on matching rows values in two different data frames-CodePudding

I have two different data frames 'df1' and 'df2' with six matching column names. I want to scan df2 for rows that match exactly in df1, and if they do enter a 1 in the 'detect' column of df1 and if not enter a 0 in that column. Currently all values of 'detect' in df1 are 0's, but I want those to change to 1 when there's an exact match between the two data frames. It would look like this:

df1

site	ddate	ssegment	spp	vtype	tperiod
BMA	6/1/2021	1	AMRO	Song	1
BMC	6/15/2021	1	WISN	Drum	1
BMA	6/15/2021	1	NOFL	Song	2
BMC	6/29/2021	2	AMRO	Call	1
BMA	6/29/2021	2	WISN	Call	2

df2

site	ddate	ssegment	spp	vtype	tperiod
BMA	6/1/2021	1	AMRO	Call	1
BMC	6/15/2021	1	WISN	Drum	1
BMA	6/15/2021	1	NOFL	Song	2
BMC	6/29/2021	2	AMRO	Drum	1
BMA	6/29/2021	2	WISN	Call	2

After scanning these, df1 would now look like:

df1

site	ddate	ssegment	spp	vtype	tperiod	detect
BMA	6/1/2021	1	AMRO	Song	1	0
BMC	6/15/2021	1	WISN	Drum	1	1
BMA	6/15/2021	1	NOFL	Song	2	1
BMC	6/29/2021	2	AMRO	Call	1	0
BMA	6/29/2021	2	WISN	Call	2	1

I was thinking that R base function 'merge' might be useful, but I can't quite figure it out. Thank you for your help!

CodePudding user response：

Start with the detect column only in df2, then merge:

df1$detect = NULL
df2$detect = 1
result = merge(df1, unique(df2), all.x = TRUE)

This will create the detect column as 1s when there are exact matches and NAs when there are not. If you want, you can change the NAs to 0s.

The same method can work with dplyr:

library(dplyr)
df1 %>% 
  select(-detect) %>%
  left_join(
    df2 %>% mutate(detect = 1) %>% unique)
  )

CodePudding user response：

There is anti_join and semi_join for filter joining of two tables:

library(tidyverse)

df1 <- tribble(
  ~site,      ~ddate, ~ssegment,   ~spp, ~vtype, ~tperiod, ~detect,
  "BMA",  "6/1/2021",        1L, "AMRO", "Song",       1L,      0L,
  "BMC", "6/15/2021",        1L, "WISN", "Drum",       1L,      0L,
  "BMA", "6/15/2021",        1L, "NOFL", "Song",       2L,      0L,
  "BMC", "6/29/2021",        2L, "AMRO", "Call",       1L,      0L,
  "BMA", "6/29/2021",        2L, "WISN", "Call",       2L,      0L
  )

df2 <- tibble::tribble(
~site,      ~ddate, ~ssegment,   ~spp, ~vtype, ~tperiod,
"BMA",  "6/1/2021",        1L, "AMRO", "Call",       1L,
"BMC", "6/15/2021",        1L, "WISN", "Drum",       1L,
"BMA", "6/15/2021",        1L, "NOFL", "Song",       2L,
"BMC", "6/29/2021",        2L, "AMRO", "Drum",       1L,
"BMA", "6/29/2021",        2L, "WISN", "Call",       2L
)


bind_rows(
  df1 %>% select(-detect) %>% anti_join(df2) %>% mutate(detect = 0),
  df1 %>% select(-detect) %>% semi_join(df2) %>% mutate(detect = 1)
)
#> Joining, by = c("site", "ddate", "ssegment", "spp", "vtype", "tperiod")
#> Joining, by = c("site", "ddate", "ssegment", "spp", "vtype", "tperiod")
#> # A tibble: 5 x 7
#>   site  ddate     ssegment spp   vtype tperiod detect
#>   <chr> <chr>        <int> <chr> <chr>   <int>  <dbl>
#> 1 BMA   6/1/2021         1 AMRO  Song        1      0
#> 2 BMC   6/29/2021        2 AMRO  Call        1      0
#> 3 BMC   6/15/2021        1 WISN  Drum        1      1
#> 4 BMA   6/15/2021        1 NOFL  Song        2      1
#> 5 BMA   6/29/2021        2 WISN  Call        2      1

^{Created on 2021-12-08 by the reprex package (v2.0.1)}

CodePudding user response：

Please find one possible and very simple solution with the data.table library

Reprex

Code

library(data.table)

setDT(df1)
setDT(df2)

df1[df2, on = .(site, ddate, ssegment, spp, vtype, tperiod), detect := TRUE][]

Output


#>    site     ddate ssegment  spp vtype tperiod detect
#> 1:  BMA  6/1/2021        1 AMRO  Song       1      0
#> 2:  BMC 6/15/2021        1 WISN  Drum       1      1
#> 3:  BMA 6/15/2021        1 NOFL  Song       2      1
#> 4:  BMC 6/29/2021        2 AMRO  Call       1      0
#> 5:  BMA 6/29/2021        2 WISN  Call       2      1

^{Created on 2021-12-08 by the reprex package (v2.0.1)}