Home > database >  How can I dynamically get words surrounding a keyword?
How can I dynamically get words surrounding a keyword?

Time:01-29

I have a sentence that may contain keywords. I search for them, if one is true, I want the word before and after the keyword.

cont <- c("could not","would not","does not","will not","do not","were not","was not","did not")
text <- "this failed to increase incomes and production did not improve"
str_extract(text,"([^\\s] \\s ){1}names(which(sapply(cont,grepl,text)))(\\s [^\\s] ){1}")

This fails when I dynamically search using the names function but if I input:

str_extract(text,"([^\\s] \\s ){1}did not(\\s [^\\s] ){1}")

it correctly returns: production did not improve.
How can I get this to function without directly inputing the keywords? Final note: I do not completely understand the syntax used to get surrounding objects. Basic r books have not covered this. Can someone explain please?

CodePudding user response:

You could use your cont vector to create a vector of regex strings:

targets <- paste0("([^\\s] \\s ){1}", cont, "(\\s [^\\s] ){1}")

Which you can feed into str_extract_all and then unlist:

unlist(stringr::str_extract_all(text, targets))
#> [1] "production did not improve"

If this is something you need to do quite frequently, you could wrap it in a function:

get_surrounding <- function(string, keywords) {
  targets <- paste0("([^\\s] \\s ){1}", keywords, "(\\s [^\\s] ){1}")
  unlist(stringr::str_extract_all(string, targets))
}

With which you can easily run the query on new strings:

new_text <- "The production did not increase because the manager would not allow it."

get_surrounding(new_text, cont)
#> [1] "manager would not allow"     "production did not increase"

CodePudding user response:

Perhaps we can try this

> regmatches(text, gregexpr(sprintf("\\w \\s(%s)\\s\\w ", paste0(cont, collapse = "|")), text))[[1]]
[1] "production did not improve"

CodePudding user response:

Each match of the following regular expression will save the preceding and following words in capture groups 1 and 2, respectively.

\\b([a-z] )  (?:could|would|does|will|do|were|was|did)  not  ([a-z] )\\b

You will of course have to form this expression programmatically, but that should be straightforward.

Hover the cursor over each element of the expression at this demo to obtain an explanation of its function.

For the string

"she could not believe that production did not improve"

there are two matches. For the first ("she could not believe") "she" and "believe" are saved to capture groups 1 and 2, respectively. For the second ("production did not improve") "production" and "improve" are saved to capture groups 1 and 2, respectively.

  •  Tags:  
  • Related