Creating new column by splitting a `chr` column, finding unique values, sorting them, removing certa-CodePudding

I'm working in R, using tidyverse and dplyr functions to generate new columns, but I'm running into a wall when trying to find unique values within a string column. Here's a detailed description of the problem.

Setup

Suppose I have a tibble called df, with a chr column called col1 that contains strings. These strings are, in reality, a list of values that are separated by a comma (", "). Here's what df looks like:

library(tidyverse)
library(dplyr)

df = data.frame(id=c(1,2,3,4,5),
                col1=c("a, b, x, a","b, b","c, b, b, b", "b, x, b, c", "c")) %>%
  as_tibble()

print(df)

# # A tibble: 5 x 2    
#      id col1         
#   <dbl> <chr>        
# 1     1 a, b, x, a      
# 2     2 b, b         
# 3     3 c, b, b, b
# 4     4 b, x, b, c      
# 5     5 c

The problem

I want to separate the values in col1 wherever we find ", ", remove any duplicate values, sort the unique values, remove the "x" values, then concatenate them back together into a string using ", " as a delimiter between the multiple unique items.

In more practical terms, I would like to create a column as seen below in col2:

# # A tibble: 5 x 3            
#      id col1          col2   
#   <dbl> <chr>         <chr>  
# 1     1 a, b, x, a    a, b   
# 2     2 b, b          b      
# 3     3 c, b, b, b    b, c   
# 4     4 b, x, b, c    b, c
# 5     5 c             c

My attempt so far

If I just have a string variable, I know that I can do all of the processing in a couple of steps:

x = "b, x, b, c"
x_temp = unique(strsplit(x, ", ")[[1]])
x_simp = paste(sort(x_temp[x_temp != "x"]), collapse=", ")
print(x_simp)
# [1] "b, c"

However, I'm having a hard time translating this process back into the mutate function:

newdf = df %>% 
  mutate(col2 = paste(sort(unique(strsplit(col1, ", ")[[1]])[unique(strsplit(col1, ", ")[[1]]) != "x"]), collapse=", "))

# A tibble: 5 x 3
#    id col1              col2 
# <dbl> <chr>             <chr>
#   1   1 a, b, x, a      a, b 
#   2   2 b, b            a, b 
#   3   3 c, b, b, b      a, b 
#   4   4 b, x, b, c      a, b 
#   5   5 c               a, b

Summary

How can I use tidyverse/dplyr functions to generate a new column that is the result of the following processing steps on one of a tibble/df's columns:

split a string/character column using a custom delimiter,
find the unique values,
sort them,
remove unwanted values,
combine them back into a single string/character value using a custom delimiter
place the result in a new column

CodePudding user response：

We may do this in tidyverse, by splitting with separate_rows, and then do a group by paste after removing the duplicates

library(dplyr)
library(tidyr)
library(stringr)
df %>%
   mutate(col2 = col1) %>% 
   separate_rows(col2) %>%
   distinct(across(everything())) %>% 
   group_by(id, col1) %>% 
   summarise(col2 = str_c(sort(col2[col2 != "x"]), collapse = ", "),
       .groups = 'drop')

-output

# A tibble: 5 × 3
     id col1       col2 
  <dbl> <chr>      <chr>
1     1 a, b, x, a a, b 
2     2 b, b       b    
3     3 c, b, b, b b, c 
4     4 b, x, b, c b, c 
5     5 c          c

CodePudding user response：

I'm not sure how efficient this is, but I just figured out I can use the mapply function to apply a custom-built function on all rows of the input tibble as follows:

myfunc = function(in_str){
  temp = unique(strsplit(in_str, ", ")[[1]])
  simp = paste(sort(temp[temp != "x"]), collapse=", ")
  return(simp)
}

newdf2 = df %>% 
  mutate(col2 = mapply(myfunc, col1))

print(newdf2)
# # A tibble: 5 x 3
#      id    col1          col2 
#   <dbl>    <chr>         <chr>
# 1     1    a, b, x, a    a, b 
# 2     2    b, b          b    
# 3     3    c, b, b, b    b, c 
# 4     4    b, x, b, c    b, c 
# 5     5    c             c