library(data.table)
dat1 <- data.table(id1 = c(1, 1, 2),
pattern = c("apple", "applejack", "bananas, sweet"))
dat2 <- data.table(id2 = c(1174, 1231),
description = c("apple is sweet", "apple is a computer"),
memo = c("bananas, sweet yes", "bananas, sweetyes"))
> dat1
id1 pattern
1: 1 apple
2: 1 applejack
3: 2 bananas, sweet
> dat2
id2 description memo
1: 1174 apple is sweet bananas, sweet yes
2: 1231 apple is a computer bananas, sweetyes
I have two data.tables, dat1 and dat2. I want to search for each pattern in dat2 against the description and memo columns in dat2 and store the corresponding id2s.
The final output table should look something like this:
id1 pattern description_match memo_match
1: 1 apple 1174,1231 <NA>
2: 1 applejack <NA> <NA>
3: 2 bananas, sweet <NA> 1174
The regular expression I want to use is \\b[pattern]\\b. Below is my attempt:
dat1[, description_match := dat2[grepl(paste0("\\b", dat1$pattern, "\\b"), dat2$description), .(id2 = paste(id2, collapse = ","))]]
dat1[, memo_match := dat2[grepl(paste0("\\b", dat1$pattern, "\\b"), dat2$memo), .(id2 = paste(id2, collapse = ","))]]
However, both give me the error that grepl can only use the first pattern.
CodePudding user response:
We group by row sequence, create the match columns from 'dat2', by pasteing the 'id2' extracted from the logical output from grepl
library(data.table)
dat1[, c("description_match", "memo_match") := {
pat <- sprintf('\\b(%s)\\b', paste(pattern, collapse = "|"))
.(toString(dat2$id2[grepl(pat, dat2$description)]),
toString(dat2$id2[grepl(pat, dat2$memo)]))
}, seq_along(id1)]
dplyr::na_if(dat1, "")
id1 pattern description_match memo_match
<num> <char> <char> <char>
1: 1 apple 1174, 1231 <NA>
2: 1 applejack <NA> <NA>
3: 2 bananas, sweet <NA> 1174
According to ?sprintf
The string fmt contains normal characters, which are passed through to the output string, and also conversion specifications which operate on the arguments provided through .... The allowed conversion specifications start with a % and end with one of the letters in the set aAdifeEgGosxX%
s - Character string. Character NAs are converted to "NA".
Or use a for loop
for(nm in names(dat2)[-1])
dat1[, paste0(nm, "_match") :=
toString(dat2$id2[grepl(paste0("\\b", pattern, "\\b"),
dat2[[nm]])]), seq_along(id1)][]
