Hello I have a data frame that looks something like this
dataframe <- data_frame(text = c('WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12',
'WUFF;other stuff to keep;WIFF2;yes yes IGWIFF'))
print(dataframe)
# A tibble: 2 × 1
text
<chr>
1 WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12
2 WUFF;other stuff to keep;WIFF2;yes yes IGWIFF
I want to extract the segment of the strings containing the word "keep". Note that these segments can be separated from other parts by different symbols for example , and ;.
the final dataset should look something like this.
final_dataframe <- data_frame(text = c('some words to keep',
'other stuff to keep'))
print(final_dataframe)
# A tibble: 2 × 1
text
<chr>
1 some words to keep
2 other stuff to keep
Does anyone know how I could do this?
CodePudding user response:
With stringr ...
library(stringr)
library(dplyr)
dataframe %>%
mutate(text = trimws(str_extract(text, "(?<=[,;]).*keep")))
# A tibble: 2 × 1
text
<chr>
1 some words to keep
2 other stuff to keep
Created on 2022-02-01 by the reprex package (v2.0.1)
CodePudding user response:
I've made great use of the positive lookbehind and positive lookahead group constructs -- check this out: https://regex101.com/r/Sc7h8O/1
If you want to assert that the text you're looking for comes after a character/group -- in your first case the apostrophe, use (?<=').
If you want to do the same but match something before ' then use (?=')
And you want to match between 0 and unlimited characters surrounding "keep" so use .* on either side, and you wind up with (?<=').*keep.*(?=')
I did find in my test that a string like text =' c('WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12', will also match the c(, which I didn't intend. But I assume your strings are all captured by pairs of apostrophes.
