I have a series of strings like "the appointment of XX as head", "appoints YY as head" (included in a data frame labelled "df" in a column labelled "title")
I want to extract the names XX, XY enclosed between the two different expressions.
I'm currently using the following:
df$name <- df$title %>%
str_extract(regex(pattern = "(?<=Appointment of).*(?= as)", ignore_case=TRUE))
However, that works with only one of the two possible patterns.
df$name <- df$title %>%
str_extract(regex(pattern = "(?<=Appointment of).*(?= as)"|"(?<=joins).*(?= as)", ignore_case=TRUE))
which of course does not work. How can I create multiple patterns to feed into str_extract?
Happy to provide further details if needed!
Thanks a lot
CodePudding user response:
You can use
df$name <- df$title %>%
str_extract(regex(pattern = "(?<=\\bAppointment of\\s|\\bjoins\\s).*?(?=\\s as\\b)", ignore_case=TRUE))
Details:
(?<=- start of a positive lookbehind\bAppointment of\s- a word boundary (\b),Appointment of, and then a whitespace char (\s)
|- or\bjoins\s- a whole wordjoinsand a whitespace
)- end of the lookbehind.*?- any zero or more chars other than line break chars(?=\s as\b)- a positive lookahead that requires one or more whitespaces,asand a word boundary immediately to the right of the current location.
Note that in stringr, the lookbehind patterns are not strictly fixed-width, you can use
"(?<=\\bAppointment of\\s{1,100}|\\bjoins\\s{1,100}).*?(?=\\s as\\b)"
where \s{1,100} can match one to a hundred whitespace chars.
CodePudding user response:
strapply can do it without using zero width constructs. Only the second capture group is returned.
library*(gsubfn)
x <- c("the appointment of XX as head", "appoints YY as head") # input
strapply(x, "(appointment of|appoints) (.*?) as head", ~ ..2, simplify = TRUE)
## [1] "XX" "YY"
or use (?:...) to specify that the first parenthesized portion is not to be a capture group:
strapply(x, "(?:appointment of|appoints) (.*?) as head", simplify = TRUE)
## [1] "XX" "YY"
Base R
In base R it could be done with sub if every component of x matches
sub(".*(appointment of|appoints) (.*?) as head.*", "\\2", x)
## [1] "XX" "YY"
or strcapture if not
proto <- data.frame(dummy = character(0), value = character(0))
strcapture("(appointment of|appoints) (.*?) as head", x, proto)[, 2]
## [1] "XX" "YY"
