R splitting string values containing columns and simultaneously duplicating other columns while allo-CodePudding

This question relates to an earlier and simpler question here. I have a complex dataframe, the result of a join between two separate dataframes, comprising three columns containing alphanumerical values, ID numbers and area codes.

df <- data.frame(column1 = c("ab 34, 35, 36", "cb 23", "df 45, 46", "gh 21"),
                 column2 = c("ID_27", "ID_28", "ID_29", "ID_30"),
                 column3 = c("area51", "area52", "area53", "area54"))

df_join <- data.frame(column1 = c("ab 77, cb 35", "df 23, gh 73", "ij 45, kl 46, mn 21"),
                 column2 = c("ID_27", "ID_28", "ID_29"),
                 column3 = c("area51", "area52", "area53"))

df_joined <- rbind(df, df_join)

df_joined

             column1 column2 column3
1       ab 34, 35, 36   ID_27  area51
2               cb 23   ID_28  area52
3           df 45, 46   ID_29  area53
4               gh 21   ID_30  area54
5        ab 77, cb 35   ID_27  area51
6        df 23, gh 73   ID_28  area52
7 ij 45, kl 46, mn 21   ID_29  area53

What I would like to do is clean up the first column by removing the commas and the spaces, so that I am left with neatly uniform 4 character values while simultaneously duplicating the values in the other columns. You will also notice that the two letter prefix changes for rows 5-7, the R code has to allow for this. This is the example of what I'm after:

new_df
# A tibble: 14 x 3
   column1 column2 column3
   <chr>   <chr>   <chr>  
 1 ab34    ID_27   area51 
 2 ab35    ID_27   area51 
 3 ab36    ID_27   area51 
 4 ab77    ID_27   area51 
 5 cb23    ID_28   area52 
 6 cb35    ID_27   area51 
 7 df23    ID_28   area52 
 8 df45    ID_29   area53 
 9 df46    ID_29   area53 
10 gh21    ID_30   area54 
11 gh73    ID_28   area52 
12 ij45    ID_29   area53 
13 kl46    ID_29   area53 
14 mn21    ID_29   area53

The end result doesn't have to be a tibble, and I am working on transforming large dataframes so the code has to be applicable in a general sense. Any ideas? Tidyverse or older methods welcome?

CodePudding user response：

Fun little problem. I've assumed the initial column1 only ever has 3 pieces. Though, this can be extended by continuing the (c1, c2, c3) if not.

# Produce the initial cleaned tibble
df_joined <- df_joined %>% 
  mutate(column1 = str_remove_all(df_joined$column1, " ")) %>% 
  separate(column1, c("c1", "c2", "c3"), sep = ",") %>% # Increase if more than 3 parts in column1
  pivot_longer(-c(column2, column3), values_to = "column1") %>% 
  select(-name) %>% 
  tidyr::drop_na() %>% 
  dplyr::relocate(column1)

# Reproduce the characters for column1, where missing
df_joined <- df_joined %>% 
  mutate(prefix = str_sub(column1, 1, 2)) %>% 
  mutate(prefix = ifelse(str_detect(prefix, "[[:lower:]]") == TRUE, prefix, lag(prefix, 1))) %>%
  mutate(prefix = ifelse(str_detect(prefix, "[[:lower:]]") == TRUE, prefix, lag(prefix, 1))) %>% # iteration for multiple missing - Extend if some of the prefixes are missing
  mutate(column1 = str_remove_all(column1, "[[:lower:]]")) %>% 
  unite(col = "column1", c(prefix, column1), sep = "")

CodePudding user response：

Here are two approaches. The key function is mystrsplit. It first captures characters and digits separately in two columns prefix and digit, then fills down all the NA prefixes using last-observation-carried-forward, and last paste prefix and digit together. After mystrsplit, we use standard data.table operations to bind all rows together.

library(data.table)

mystrsplit <- function(x) {
  x <- stringr::str_match_all(x, "(?<prefix>[a-z] )? (?<digit>\\d )")
  lapply(x, function(s) paste0(tidyr:::fillDown(s[, "prefix"]), s[, "digit"]))
}

setDT(df_joined)[, c(
  list(column1 = unlist(x <- mystrsplit(column1))), 
  lapply(.SD, rep, times = lengths(x))
), .SDcols = -"column1"]

Output

    column1 column2 column3
 1:    ab34   ID_27  area51
 2:    ab35   ID_27  area51
 3:    ab36   ID_27  area51
 4:    cb23   ID_28  area52
 5:    df45   ID_29  area53
 6:    df46   ID_29  area53
 7:    gh21   ID_30  area54
 8:    ab77   ID_27  area51
 9:    cb35   ID_27  area51
10:    df23   ID_28  area52
11:    gh73   ID_28  area52
12:    ij45   ID_29  area53
13:    kl46   ID_29  area53
14:    mn21   ID_29  area53

Or in a tidyverse way

library(dplyr)
library(tidyr)

mystrsplit <- function(x) {
  x <- stringr::str_match_all(x, "(?<prefix>[a-z] )? (?<digit>\\d )")
  lapply(x, function(s) paste0(tidyr:::fillDown(s[, "prefix"]), s[, "digit"]))
}

df_joined %>% mutate(column1 = mystrsplit(column1)) %>% unnest(column1)

Output

# A tibble: 14 x 3
   column1 column2 column3
   <chr>   <chr>   <chr>  
 1 ab34    ID_27   area51 
 2 ab35    ID_27   area51 
 3 ab36    ID_27   area51 
 4 cb23    ID_28   area52 
 5 df45    ID_29   area53 
 6 df46    ID_29   area53 
 7 gh21    ID_30   area54 
 8 ab77    ID_27   area51 
 9 cb35    ID_27   area51 
10 df23    ID_28   area52 
11 gh73    ID_28   area52 
12 ij45    ID_29   area53 
13 kl46    ID_29   area53 
14 mn21    ID_29   area53