I have a list of country names and a dataframe containing one column of text and one column of binary indicators.
MWE:
rm(list=ls())
library(countrycode)
country_list <- countrycode::codelist$country.name.en
Text <- c("This is","a test to", "find country", "names like Algeria", "Albania and Afghanistan","in the data","and return only the","first match in each","string, Algeria and Albania", "not Afghanistan")
df <- as.data.frame(Text)
df$ofInterest <- c(0,0,0,1,1,1,0,0,1,0)
I want to return the first word (and only the first word) in df$Text that matches any element in country_list. In other words, I'm only interested in the very first country name that gets mentioned.
The operation should create a new column in df indicating the matched country name, or NA if no matches from country_list were found, for each row.
To make things faster, I also want to restrict the search to rows where df$ofInterest==1.
In other words, it should return the following:
Text ofInterest Match
This is 0 NA
a test to 0 NA
find country 0 NA
names like Algeria 1 Algeria
Albania and Afghanistan 1 Albania
in the data 1 NA
and return only the 0 NA
first match in each 0 NA
string, Algeria and Albania 1 Algeria
not Afghanistan 0 Afghanistan
My problem is that I don't know how to use regex while also pattern matching from a list. How can I do this in R?
This was as far as I can get. The "xxxxx" is presumably where the country_name list should go.
This is probably a simple problem, but I couldn't find the solution. Thank you for any help!
df$Match <- ifelse(str_extract(df$Text, "(?<=^| )xxxxx.*?(?=$| )") %in% country_list, str_extract(df$Text, "(?<=^| )xxxxx.*?(?=$| )"), NA)
CodePudding user response:
You can use
df$Match <- str_extract(df$Text, paste0("(?i)\\b(", paste(country_list, collapse="|"), ")\\b"))
df <- within(df, Match[ofInterest == '0'] <- NA)
# > df
# Text ofInterest Match
# 1 This is 0 <NA>
# 2 a test to 0 <NA>
# 3 find country 0 <NA>
# 4 names like Algeria 1 Algeria
# 5 Albania and Afghanistan 1 Albania
# 6 in the data 1 <NA>
# 7 and return only the 0 <NA>
# 8 first match in each 0 <NA>
# 9 string, Algeria and Albania 1 Algeria
# 10 not Afghanistan 0 <NA>
Here, paste0("(?i)\\b(", paste(country_list, collapse="|"), ")\\b") will create a pattern like
(?i)- case insensitive matching\b- a word boundary(- start of a capturing group:paste(country_list, collapse="|")will result in a|-separated list of country names, likeAlbania|Poland|Franceetc.
)- end ofthe group\b- word boundary.
The df <- within(df, Match[ofInterest == '0'] <- NA) will revert NA in all Match rows where ofInterest columnn value is 0.
