Home > Mobile >  Can %in% be used in base R to match value pairs?
Can %in% be used in base R to match value pairs?

Time:01-07

I'm familiar with %in% generally, and I'm looking for a base R solution, if one exists.

Suppose I want to know whether a particular combination of values from multiple fields in a data frame exists in another data frame. As a work-around, sometimes I concatenate all these values into a single field and match on the custom concatenation, but I'm wondering if there's a way to pass the value combinations to %in% directly.

I'm imagining syntax similar to deduplicating on unique combinations of values across multiple columns, whose syntax works like this, by way of a generic example:

df[!duplicated(df[,c("col1","col2","col3")]),]

I was sort of expecting something like this to work, but I see why it doesn't:

df1[df1[,c("col1","col2")] %in% df2[,c("col1","col2")],]

... above, I'm attempting to ask which value pairs in df1 also exist as value pairs in df2.

CodePudding user response:

You can use mapply to create a logical matrix of matches and then use it to subset df1.

Test data.

set.seed(2022)
df1 <- data.frame(col1 = letters[1:10], col2 = 1:10, col3 = 11:20)
df2 <- data.frame(col1 = sample(letters[1:10], 4), 
                  col2 = sample(1:10, 4), col3 = 11:14)

Here I start by putting the columns in a vector, it simplifies the code.

cols <- c("col1", "col2")
(i <- mapply(\(x, y) x %in% y, df1[cols], df2[cols]))
#       col1  col2
# [1,] FALSE FALSE
# [2,] FALSE FALSE
# [3,]  TRUE FALSE
# [4,]  TRUE  TRUE
# [5,] FALSE FALSE
# [6,]  TRUE  TRUE
# [7,]  TRUE  TRUE
# [8,] FALSE FALSE
# [9,] FALSE  TRUE
#[10,] FALSE FALSE

Now subset. The question is not very clear on which of the following is asked for.

# at least one column match
j <- rowSums(i) > 0L
df1[j, ]
#  col1 col2 col3
#3    c    3   13
#4    d    4   14
#6    f    6   16
#7    g    7   17
#9    i    9   19

# all columns match
k <- rowSums(i) == length(cols)
df1[k, ]
#  col1 col2 col3
#4    d    4   14
#6    f    6   16
#7    g    7   17

CodePudding user response:

Would just doing a merge() by the two columns of interest get you what you need?

# Sim data borrowed from Rui Barradas' answer
set.seed(2022)
df1 <- data.frame(col1 = letters[1:10], col2 = 1:10, col3 = 11:20)
df2 <- data.frame(col1 = sample(letters[1:10], 4), 
                  col2 = sample(1:10, 4), col3 = 11:14)

# just merge based on the two columns of interest
p <- merge(df1, df2, by=c("col1", "col2"))

# if you want to drop the .y column(s)  
p[,-grep(".y", names(p), fixed = T)]
#>   col1 col2 col3.x
#> 1    d    4     14

Created on 2022-01-06 by the reprex package (v2.0.1)

  •  Tags:  
  • Related