I have two data sources / columns A & B, both having company names, possibly in a slightly different format. The data sources are of different sizes. For example:
A:
| Company |
|---|
| Factory |
| Shop |
| Restaurant |
B:
| Company |
|---|
| Factory Inc |
| Shop Ltd |
| Service |
| Laundry |
How could I look up row-wise, if A company exists, perhaps as a substring in a longer column in the data set B?
CodePudding user response:
One approach is to use sapply and str_detect
sapply(df1$Company, function(x) str_detect(df2$Company, x))
Factory Shop Restaurant
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE FALSE
CodePudding user response:
If you want to know which element of A is contained within which element of B, you can use the base function grepl with option fixed = TRUE. This option avoids unexpected behaviour if your text contains characters that are reserved in regular expressions, such as ., ?, ( or similar.
A <- c("Factory", "Shop", "Restaurant")
B <- c("Factory Inc", "Shop Ltd", "Service", "Laundry")
sapply(A, \(string) grepl(string, B, fixed = TRUE))
Factory Shop Restaurant
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE FALSE
However, if you only want to know if a row in A is present anywhere in B, you can add any to the expression...
sapply(A, \(string) any(grepl(string, B, fixed = TRUE)))
Factory Shop Restaurant
TRUE TRUE FALSE
...or you can "cheat" by turning B into one big string (potentially using unique to account for duplication). Use a separator that would not be present elsewhere to avoid creating new "words" at the boundaries.
Bcheck <- paste(unique(B), collapse="|")
Bcheck
[1] "Factory Inc|Shop Ltd|Service|Laundry"
sapply(A, \(string) grepl(string, Bcheck, fixed = TRUE))
Factory Shop Restaurant
TRUE TRUE FALSE
The latter is a bit faster according to microbenchmark.
