Find the original strings from the results of str_extract

I'm still young to coding and cannot figure out better functions or results to some tasks by myself very often. I have a question on tracking the original string after using str_extract_all for a specific pattern.

Here is an example data called "fruit".

index	Fruit
1	apple
2	banana
3	strawberry
4	pineapple
5	bell pepper

I used str_extract_all(fruit, "(.)\\1") to extract duplicated consonants, and get "pp", "rr", "pp", "ll", "pp".

Also tracked the original string (of those extracted results) by str_subset(fruit, "(.)\\1"). Here's what I get.

index	fruit
1	apple
2	strawberry
3	pineapple
4	bell pepper

However, I want to know where "each" extracted result is from. Therefore, using str_subset cannot capture those results which are from the same string. The following dataframe is what I expect to gain.

index	fruit	pattern
1	apple	pp
2	strawberry	rr
3	pineapple	pp
4	bell pepper	ll
4	bell pepper	pp

I'm not sure if I explain my question clearly. Any feedbacks and ideas will be appreciate.

CodePudding user response：

Your code already did what you want. You just need to create an extra column to store the output of str_extract_all, like the following:

library(tidyverse)

fruit %>% mutate(pattern = str_extract_all(Fruit, "(.)\\1")) %>% unnest(pattern)

# A tibble: 5 × 3
  index Fruit       pattern
  <int> <chr>       <chr>  
1     1 apple       pp     
2     3 strawberry  rr     
3     4 pineapple   pp     
4     5 bell pepper ll     
5     5 bell pepper pp

CodePudding user response：

You may simply remove all rows where pattern column contains NA after extracting:

library(stringr)
df$pattern <- stringr::str_extract(df$xfruit "(.)\\1")
df <- df[!is.na(df$pattern),]

The str_extract(df$xfruit "(.)\\1") will extract the repeated chars into the pattern column. Then, df[!is.na(df$pattern),] removes the rows where the pattern is NA.