Home > Back-end >  Is there a way to automatically drop delimiters when using a tidyr separate with a regex positive lo
Is there a way to automatically drop delimiters when using a tidyr separate with a regex positive lo

Time:01-09

I'm trying to separate a messy free text field in which full names have been split by multiple delimiters. I've been able to split these into first names and surnames using the separate() function. However, the positive lookahead regex I've used leaves the delimitor in the surname. Is there an efficient way this could be automatically removed?

Example:

library(tibble)
library(tidyr)

mydf <- tribble(
  ~Name,
  "Ben,Smith",
  "Amy:White",
  "Mark|Williams"
)

mydf_split <- mydf %>% separate(Name, c("first_name","surname"), sep = '(?=,|\\:|\\|)')

This returns the following dataframe.

    first_name  surname
1   Ben         ,Smith
2   Amy         :White
3   Mark        |Williams

I would like to remove the , :, and | from the beginning of the surnames.

CodePudding user response:

You need to use a character class:

mydf %>% separate(Name, c("first_name","surname"), sep = '[,:|]')
# # A tibble: 3 x 2
#   first_name surname 
#   <chr>      <chr>   
# 1 Ben        Smith   
# 2 Amy        White   
# 3 Mark       Williams

The (?=,|\:|\|) regex is a positive lookahead that is a non-consuming pattern matching a location immediately followed with ,, : or | chars. Non-consuming means the chars matched with this pattern remain in the resulting chunks after splitting. You need to remove them after splitting, so you need to consume them, and here, a character class without a lookahead is the most natural regex solution.

CodePudding user response:

The default separator of [^[:alnum:]] works with the data in the question

mydf %>% separate(Name, c("first_name", "surname"))

giving:

# A tibble: 3 x 2
  first_name surname 
  <chr>      <chr>   
1 Ben        Smith   
2 Amy        White   
3 Mark       Williams
  •  Tags:  
  • Related