I have a dataframe with a column containing various words. I also have a separate list of strings (not the same length as the df), and I'd like to create a new column in the dataframe which matches the strings to the words in the column, but only keep the part of the string up to that word.
So for example: I have this table:
| words | |
|---|---|
| apple | |
| plant | |
| banana | |
| animal | |
| fly | |
| ecoli |
and these strings of words:
stringlist <- c("eukaryote;plant;apple", "eukaryote;plant;banana","eukaryote;animal;dog", "eukaryote;plant;orange" "eukaryote;animal;cat"; "eukaryote;insect;fly", "prokaryote;bacterium;ecoli")
and I'd like to get this:
| words | new_words |
|---|---|
| apple | eukaryote;plant;apple |
| plant | eukaryote;plant |
| banana | eukaryote;plant;banana |
| animal | eukaryote;animal |
| fly | eukaryote;insect;fly |
| ecoli | prokaryote;bacterium;ecoli |
I've tried something along the lines of :
df$words <- c("apple", "plant", "banana", "animal", "fly", "ecoli")
df$new_words<- sub(df$words, "", stringlist)
CodePudding user response:
Loop over the 'words' column, get the matching 'stringlist' value with grep, use sub to capture the characters including the word and replace it with backreference (\\1) of the captured group
df$new_words <- sapply(df$words, function(x)
sub(sprintf("(.*%s).*", x), "\\1", grep(x, stringlist,
value = TRUE)[1]))
-output
> df
words new_words
1 apple eukaryote;plant;apple
2 plant eukaryote;plant
3 banana eukaryote;plant;banana
4 animal eukaryote;animal
5 fly eukaryote;insect;fly
6 ecoli prokaryote;bacterium;ecoli
data
df <- structure(list(words = c("apple", "plant", "banana", "animal",
"fly", "ecoli")), class = "data.frame", row.names = c(NA, -6L
))
stringlist <- c("eukaryote;plant;apple", "eukaryote;plant;banana",
"eukaryote;animal;dog",
"eukaryote;plant;orange", "eukaryote;animal;cat", "eukaryote;insect;fly",
"prokaryote;bacterium;ecoli")
