regex: extract segments of a string containing a word, between symbols-CodePudding

Hello I have a data frame that looks something like this

dataframe <- data_frame(text = c('WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12',
                                 'WUFF;other stuff to keep;WIFF2;yes yes IGWIFF'))

print(dataframe)
# A tibble: 2 × 1
  text                                                   
  <chr>                                                  
1 WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12
2 WUFF;other stuff to keep;WIFF2;yes yes IGWIFF

I want to extract the segment of the strings containing the word "keep". Note that these segments can be separated from other parts by different symbols for example , and ;.

the final dataset should look something like this.

final_dataframe <- data_frame(text = c('some words to keep',
                                 'other stuff to keep'))

print(final_dataframe)
# A tibble: 2 × 1
  text               
  <chr>              
1 some words to keep 
2 other stuff to keep

Does anyone know how I could do this?

CodePudding user response：

With stringr ...


library(stringr)
library(dplyr)

dataframe %>% 
   mutate(text = trimws(str_extract(text, "(?<=[,;]).*keep")))
# A tibble: 2 × 1
  text               
  <chr>              
1 some words to keep 
2 other stuff to keep

^{Created on 2022-02-01 by the reprex package (v2.0.1)}

CodePudding user response：

I've made great use of the positive lookbehind and positive lookahead group constructs -- check this out: https://regex101.com/r/Sc7h8O/1

If you want to assert that the text you're looking for comes after a character/group -- in your first case the apostrophe, use (?<=').

If you want to do the same but match something before ' then use (?=') And you want to match between 0 and unlimited characters surrounding "keep" so use .* on either side, and you wind up with (?<=').*keep.*(?=')

I did find in my test that a string like text =' c('WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12', will also match the c(, which I didn't intend. But I assume your strings are all captured by pairs of apostrophes.