Extract string up to a different word in each row

I have a dataframe with a column containing various words. I also have a separate list of strings (not the same length as the df), and I'd like to create a new column in the dataframe which matches the strings to the words in the column, but only keep the part of the string up to that word.

So for example: I have this table:

words
apple
plant
banana
animal
fly
ecoli

and these strings of words:

stringlist <- c("eukaryote;plant;apple", "eukaryote;plant;banana","eukaryote;animal;dog", "eukaryote;plant;orange" "eukaryote;animal;cat"; "eukaryote;insect;fly", "prokaryote;bacterium;ecoli")

and I'd like to get this:

words	new_words
apple	eukaryote;plant;apple
plant	eukaryote;plant
banana	eukaryote;plant;banana
animal	eukaryote;animal
fly	eukaryote;insect;fly
ecoli	prokaryote;bacterium;ecoli

I've tried something along the lines of :

df$words <- c("apple", "plant", "banana", "animal", "fly", "ecoli")
df$new_words<- sub(df$words, "", stringlist)

CodePudding user response：

Loop over the 'words' column, get the matching 'stringlist' value with grep, use sub to capture the characters including the word and replace it with backreference (\\1) of the captured group

df$new_words <- sapply(df$words, function(x) 
    sub(sprintf("(.*%s).*", x), "\\1", grep(x, stringlist, 
     value = TRUE)[1]))

-output

> df
   words                  new_words
1  apple      eukaryote;plant;apple
2  plant            eukaryote;plant
3 banana     eukaryote;plant;banana
4 animal           eukaryote;animal
5    fly       eukaryote;insect;fly
6  ecoli prokaryote;bacterium;ecoli

data

df <- structure(list(words = c("apple", "plant", "banana", "animal", 
"fly", "ecoli")), class = "data.frame", row.names = c(NA, -6L
))

stringlist <- c("eukaryote;plant;apple", "eukaryote;plant;banana", 
"eukaryote;animal;dog", 
"eukaryote;plant;orange", "eukaryote;animal;cat", "eukaryote;insect;fly", 
"prokaryote;bacterium;ecoli")