This question relates to an earlier and simpler question here. I have a complex dataframe, the result of a join between two separate dataframes, comprising three columns containing alphanumerical values, ID numbers and area codes.
df <- data.frame(column1 = c("ab 34, 35, 36", "cb 23", "df 45, 46", "gh 21"),
column2 = c("ID_27", "ID_28", "ID_29", "ID_30"),
column3 = c("area51", "area52", "area53", "area54"))
df_join <- data.frame(column1 = c("ab 77, cb 35", "df 23, gh 73", "ij 45, kl 46, mn 21"),
column2 = c("ID_27", "ID_28", "ID_29"),
column3 = c("area51", "area52", "area53"))
df_joined <- rbind(df, df_join)
df_joined
column1 column2 column3
1 ab 34, 35, 36 ID_27 area51
2 cb 23 ID_28 area52
3 df 45, 46 ID_29 area53
4 gh 21 ID_30 area54
5 ab 77, cb 35 ID_27 area51
6 df 23, gh 73 ID_28 area52
7 ij 45, kl 46, mn 21 ID_29 area53
What I would like to do is clean up the first column by removing the commas and the spaces, so that I am left with neatly uniform 4 character values while simultaneously duplicating the values in the other columns. You will also notice that the two letter prefix changes for rows 5-7, the R code has to allow for this. This is the example of what I'm after:
new_df
# A tibble: 14 x 3
column1 column2 column3
<chr> <chr> <chr>
1 ab34 ID_27 area51
2 ab35 ID_27 area51
3 ab36 ID_27 area51
4 ab77 ID_27 area51
5 cb23 ID_28 area52
6 cb35 ID_27 area51
7 df23 ID_28 area52
8 df45 ID_29 area53
9 df46 ID_29 area53
10 gh21 ID_30 area54
11 gh73 ID_28 area52
12 ij45 ID_29 area53
13 kl46 ID_29 area53
14 mn21 ID_29 area53
The end result doesn't have to be a tibble, and I am working on transforming large dataframes so the code has to be applicable in a general sense. Any ideas? Tidyverse or older methods welcome?
CodePudding user response:
Fun little problem. I've assumed the initial column1 only ever has 3 pieces. Though, this can be extended by continuing the (c1, c2, c3) if not.
# Produce the initial cleaned tibble
df_joined <- df_joined %>%
mutate(column1 = str_remove_all(df_joined$column1, " ")) %>%
separate(column1, c("c1", "c2", "c3"), sep = ",") %>% # Increase if more than 3 parts in column1
pivot_longer(-c(column2, column3), values_to = "column1") %>%
select(-name) %>%
tidyr::drop_na() %>%
dplyr::relocate(column1)
# Reproduce the characters for column1, where missing
df_joined <- df_joined %>%
mutate(prefix = str_sub(column1, 1, 2)) %>%
mutate(prefix = ifelse(str_detect(prefix, "[[:lower:]]") == TRUE, prefix, lag(prefix, 1))) %>%
mutate(prefix = ifelse(str_detect(prefix, "[[:lower:]]") == TRUE, prefix, lag(prefix, 1))) %>% # iteration for multiple missing - Extend if some of the prefixes are missing
mutate(column1 = str_remove_all(column1, "[[:lower:]]")) %>%
unite(col = "column1", c(prefix, column1), sep = "")
CodePudding user response:
Here are two approaches. The key function is mystrsplit. It first captures characters and digits separately in two columns prefix and digit, then fills down all the NA prefixes using last-observation-carried-forward, and last paste prefix and digit together. After mystrsplit, we use standard data.table operations to bind all rows together.
library(data.table)
mystrsplit <- function(x) {
x <- stringr::str_match_all(x, "(?<prefix>[a-z] )? (?<digit>\\d )")
lapply(x, function(s) paste0(tidyr:::fillDown(s[, "prefix"]), s[, "digit"]))
}
setDT(df_joined)[, c(
list(column1 = unlist(x <- mystrsplit(column1))),
lapply(.SD, rep, times = lengths(x))
), .SDcols = -"column1"]
Output
column1 column2 column3
1: ab34 ID_27 area51
2: ab35 ID_27 area51
3: ab36 ID_27 area51
4: cb23 ID_28 area52
5: df45 ID_29 area53
6: df46 ID_29 area53
7: gh21 ID_30 area54
8: ab77 ID_27 area51
9: cb35 ID_27 area51
10: df23 ID_28 area52
11: gh73 ID_28 area52
12: ij45 ID_29 area53
13: kl46 ID_29 area53
14: mn21 ID_29 area53
Or in a tidyverse way
library(dplyr)
library(tidyr)
mystrsplit <- function(x) {
x <- stringr::str_match_all(x, "(?<prefix>[a-z] )? (?<digit>\\d )")
lapply(x, function(s) paste0(tidyr:::fillDown(s[, "prefix"]), s[, "digit"]))
}
df_joined %>% mutate(column1 = mystrsplit(column1)) %>% unnest(column1)
Output
# A tibble: 14 x 3
column1 column2 column3
<chr> <chr> <chr>
1 ab34 ID_27 area51
2 ab35 ID_27 area51
3 ab36 ID_27 area51
4 cb23 ID_28 area52
5 df45 ID_29 area53
6 df46 ID_29 area53
7 gh21 ID_30 area54
8 ab77 ID_27 area51
9 cb35 ID_27 area51
10 df23 ID_28 area52
11 gh73 ID_28 area52
12 ij45 ID_29 area53
13 kl46 ID_29 area53
14 mn21 ID_29 area53
