Checking in R whether a string is a substring in another column of longer length-CodePudding

I have two data sources / columns A & B, both having company names, possibly in a slightly different format. The data sources are of different sizes. For example:

Company
Factory
Shop
Restaurant

Company
Factory Inc
Shop Ltd
Service
Laundry

How could I look up row-wise, if A company exists, perhaps as a substring in a longer column in the data set B?

CodePudding user response：

One approach is to use sapply and str_detect

sapply(df1$Company, function(x) str_detect(df2$Company, x))

     Factory  Shop Restaurant
[1,]    TRUE FALSE      FALSE
[2,]   FALSE  TRUE      FALSE
[3,]   FALSE FALSE      FALSE
[4,]   FALSE FALSE      FALSE

CodePudding user response：

If you want to know which element of A is contained within which element of B, you can use the base function grepl with option fixed = TRUE. This option avoids unexpected behaviour if your text contains characters that are reserved in regular expressions, such as ., ?, ( or similar.

A <- c("Factory", "Shop", "Restaurant")
B <- c("Factory Inc", "Shop Ltd", "Service", "Laundry")
sapply(A, \(string) grepl(string, B, fixed = TRUE))

     Factory  Shop Restaurant
[1,]    TRUE FALSE      FALSE
[2,]   FALSE  TRUE      FALSE
[3,]   FALSE FALSE      FALSE
[4,]   FALSE FALSE      FALSE

However, if you only want to know if a row in A is present anywhere in B, you can add any to the expression...

sapply(A, \(string) any(grepl(string, B, fixed = TRUE)))
   Factory       Shop Restaurant 
      TRUE       TRUE      FALSE

...or you can "cheat" by turning B into one big string (potentially using unique to account for duplication). Use a separator that would not be present elsewhere to avoid creating new "words" at the boundaries.

Bcheck <- paste(unique(B), collapse="|")
Bcheck
[1] "Factory Inc|Shop Ltd|Service|Laundry"

sapply(A, \(string) grepl(string, Bcheck, fixed = TRUE))
   Factory       Shop Restaurant 
      TRUE       TRUE      FALSE

The latter is a bit faster according to microbenchmark.