I have a dataframe column that has a string, which may include several spaces. I want to use separate from tidyr (or something similar) on the space after the first time a keyword (i.e., fruit_key in the sample data) appears, so that I separate the one column into two columns.
Sample Data
df <- structure(list(fruit = c("Apple Orange Pineapple", "Plum Good Watermelon",
"Plum Good Kiwi", "Plum Good Plum Good", "Cantaloupe Melon", "Blueberry Blackberry Cobbler",
"Peach Pie Apple Pie")), class = "data.frame", row.names = c(NA,
-7L))
fruit_key <- c("Apple", "Plum Good", "Cantaloupe", "Blueberry", "Peach Pie")
Expected Output
fruit Delicious Tasty
1 Apple Orange Pineapple Apple Orange Pineapple
2 Plum Good Watermelon Plum Good Watermelon
3 Plum Good Kiwi Plum Good Kiwi
4 Plum Good Plum Good Plum Good Plum Good
5 Cantaloupe Melon Cantaloupe Melon
6 Blueberry Blackberry Cobbler Blueberry Blackberry Cobbler
7 Peach Pie Apple Pie Peach Pie Apple Pie
I can get the part after the keyword with separate into the correct column (i.e., Tasty), but cannot get the actual keyword to return for the other column (i.e., Delicious). I tried several altering the regular expression, but could never get the correct output.
library(tidyr)
separate(df, fruit,
c("Delicious", "Tasty"),
sep = paste(fruit_key, collapse = "|"),
extra = "merge",
remove = FALSE
)
# fruit Delicious Tasty
#1 Apple Orange Pineapple Orange Pineapple
#2 Plum Good Watermelon Watermelon
#3 Plum Good Kiwi Kiwi
#4 Plum Good Plum Good Plum Good
#5 Cantaloupe Melon Melon
#6 Blueberry Blackberry Cobbler Blackberry Cobbler
#7 Peach Pie Apple Pie Apple Pie
I know that I could use str_extract and str_remove (like below), but want to use something like separate to do it in one function/step.
library(tidyverse)
df %>%
mutate(Delicious = str_extract(fruit, paste(fruit_key, collapse = "|")),
Tasty = str_remove(fruit, paste(fruit_key, collapse = "|")))
CodePudding user response:
Here's a tidy solution with tidyr's function extract:
library(tidyr)
df %>%
extract(fruit,
into = c("Delicious", "Tasty"),
regex = paste0("(", paste0(fruit_key, collapse = "|"), ")\\s(.*)"),
remove = FALSE)
fruit Delicious Tasty
1 Apple Orange Pineapple Apple Orange Pineapple
2 Plum Good Watermelon Plum Good Watermelon
3 Plum Good Kiwi Plum Good Kiwi
4 Plum Good Plum Good Plum Good Plum Good
5 Cantaloupe Melon Cantaloupe Melon
6 Blueberry Blackberry Cobbler Blueberry Blackberry Cobbler
7 Peach Pie Apple Pie Peach Pie Apple Pie
In extract's regex argument, we collapse fruit_keyinto an alternation pattern, which we wrap into parentheses so that it is recognized as a capturing group. The second capturing group is simply whatever follows after the whitespace.
CodePudding user response:
If we need to use separate with sep, then create a regex lookaround - "(?<=<fruit_key>) " i.e. split at the space that succeeds the fruit_key word and as is not vectorized, collapse into a single string with | (str_c)
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate(fruit, into = c("Delicious", "Tasty"),
sep = str_c(sprintf("(?<=%s) ", fruit_key), collapse = "|"),
extra = "merge", remove = FALSE)
-output
fruit Delicious Tasty
1 Apple Orange Pineapple Apple Orange Pineapple
2 Plum Good Watermelon Plum Good Watermelon
3 Plum Good Kiwi Plum Good Kiwi
4 Plum Good Plum Good Plum Good Plum Good
5 Cantaloupe Melon Cantaloupe Melon
6 Blueberry Blackberry Cobbler Blueberry Blackberry Cobbler
7 Peach Pie Apple Pie Peach Pie Apple Pie
