I'm still young to coding and cannot figure out better functions or results to some tasks by myself very often.
I have a question on tracking the original string after using str_extract_all for a specific pattern.
Here is an example data called "fruit".
| index | Fruit |
|---|---|
| 1 | apple |
| 2 | banana |
| 3 | strawberry |
| 4 | pineapple |
| 5 | bell pepper |
I used str_extract_all(fruit, "(.)\\1") to extract duplicated consonants, and get "pp", "rr", "pp", "ll", "pp".
Also tracked the original string (of those extracted results) by str_subset(fruit, "(.)\\1"). Here's what I get.
| index | fruit |
|---|---|
| 1 | apple |
| 2 | strawberry |
| 3 | pineapple |
| 4 | bell pepper |
However, I want to know where "each" extracted result is from. Therefore, using str_subset cannot capture those results which are from the same string. The following dataframe is what I expect to gain.
| index | fruit | pattern |
|---|---|---|
| 1 | apple | pp |
| 2 | strawberry | rr |
| 3 | pineapple | pp |
| 4 | bell pepper | ll |
| 4 | bell pepper | pp |
I'm not sure if I explain my question clearly. Any feedbacks and ideas will be appreciate.
CodePudding user response:
Your code already did what you want. You just need to create an extra column to store the output of str_extract_all, like the following:
library(tidyverse)
fruit %>% mutate(pattern = str_extract_all(Fruit, "(.)\\1")) %>% unnest(pattern)
# A tibble: 5 × 3
index Fruit pattern
<int> <chr> <chr>
1 1 apple pp
2 3 strawberry rr
3 4 pineapple pp
4 5 bell pepper ll
5 5 bell pepper pp
CodePudding user response:
You may simply remove all rows where pattern column contains NA after extracting:
library(stringr)
df$pattern <- stringr::str_extract(df$xfruit "(.)\\1")
df <- df[!is.na(df$pattern),]
The str_extract(df$xfruit "(.)\\1") will extract the repeated chars into the pattern column. Then, df[!is.na(df$pattern),] removes the rows where the pattern is NA.
