I'm familiar with %in% generally, and I'm looking for a base R solution, if one exists.
Suppose I want to know whether a particular combination of values from multiple fields in a data frame exists in another data frame. As a work-around, sometimes I concatenate all these values into a single field and match on the custom concatenation, but I'm wondering if there's a way to pass the value combinations to %in% directly.
I'm imagining syntax similar to deduplicating on unique combinations of values across multiple columns, whose syntax works like this, by way of a generic example:
df[!duplicated(df[,c("col1","col2","col3")]),]
I was sort of expecting something like this to work, but I see why it doesn't:
df1[df1[,c("col1","col2")] %in% df2[,c("col1","col2")],]
... above, I'm attempting to ask which value pairs in df1 also exist as value pairs in df2.
CodePudding user response:
You can use mapply to create a logical matrix of matches and then use it to subset df1.
Test data.
set.seed(2022)
df1 <- data.frame(col1 = letters[1:10], col2 = 1:10, col3 = 11:20)
df2 <- data.frame(col1 = sample(letters[1:10], 4),
col2 = sample(1:10, 4), col3 = 11:14)
Here I start by putting the columns in a vector, it simplifies the code.
cols <- c("col1", "col2")
(i <- mapply(\(x, y) x %in% y, df1[cols], df2[cols]))
# col1 col2
# [1,] FALSE FALSE
# [2,] FALSE FALSE
# [3,] TRUE FALSE
# [4,] TRUE TRUE
# [5,] FALSE FALSE
# [6,] TRUE TRUE
# [7,] TRUE TRUE
# [8,] FALSE FALSE
# [9,] FALSE TRUE
#[10,] FALSE FALSE
Now subset. The question is not very clear on which of the following is asked for.
# at least one column match
j <- rowSums(i) > 0L
df1[j, ]
# col1 col2 col3
#3 c 3 13
#4 d 4 14
#6 f 6 16
#7 g 7 17
#9 i 9 19
# all columns match
k <- rowSums(i) == length(cols)
df1[k, ]
# col1 col2 col3
#4 d 4 14
#6 f 6 16
#7 g 7 17
CodePudding user response:
Would just doing a merge() by the two columns of interest get you what you need?
# Sim data borrowed from Rui Barradas' answer
set.seed(2022)
df1 <- data.frame(col1 = letters[1:10], col2 = 1:10, col3 = 11:20)
df2 <- data.frame(col1 = sample(letters[1:10], 4),
col2 = sample(1:10, 4), col3 = 11:14)
# just merge based on the two columns of interest
p <- merge(df1, df2, by=c("col1", "col2"))
# if you want to drop the .y column(s)
p[,-grep(".y", names(p), fixed = T)]
#> col1 col2 col3.x
#> 1 d 4 14
Created on 2022-01-06 by the reprex package (v2.0.1)
