Transforming dummy variables to single column in R-CodePudding

I have the following table in R which lists a person race, gender, age, and cholesterol test. age and cholesterol test are displayed as dummy variables. age can be categorized as low, medium, or high, while cholesterol tests can be categorized as low or high. I want to transform the age and cholesterol columns to be single columns where low is categorized as 1, medium is categorized as 2, and high is categorized as 3. Cholesterol test can be neigh low or high if a person never took one and should be N/A in the expected output. I want the solution to be dynamic so that if I have multiple columns in this format, the code would still work (i.e. there may be some new tests, which can be categorized as high, low, or medium as dummy variables).

How can I do this in R?

input:

  race  gender age.low_tm1 age.medium_tm1 age.high_tm1 chol_test.low_tm1 chol_test.high_tm1
  <chr>  <int>       <int>          <int>        <int>             <int>              <int>
1 white      0           1              0            0                 0                  0
2 white      0           1              0            0                 0                  0
3 white      1           1              0            0                 0                  0
4 black      1           0              1            0                 0                  0
5 white      0           0              0            1                 0                  1
6 black      0           0              1            0                 1                  0

expected output:

  race  gender   age  chol_test
1 white      0     1        n/a  
2 white      0     1        n/a
3 white      1     1        n/a
4 black      1     2        n/a
5 white      0     3          3
6 black      0     2          1

CodePudding user response：

Perhaps this helps

library(dplyr)
library(tidyr)
library(stringr)
df1 %>% 
   mutate(across(contains("_"),  ~  
   . * setNames(1:3, c("low", "medium", "high"))[
     str_extract(cur_column(), "low|medium|high")]))   %>%    
  rename_with(~ str_remove(., "_tm1")) %>% 
  pivot_longer(cols = -c(race, gender), 
    names_to = c(".value", "categ"), names_sep = "\\.") %>% 
  filter(age > 0|chol_test > 0) %>% 
  select(-categ) %>% 
  mutate(chol_test = na_if(chol_test, 0))

-output

# A tibble: 7 × 4
  race  gender   age chol_test
  <chr>  <int> <int>     <int>
1 white      0     1        NA
2 white      0     1        NA
3 white      1     1        NA
4 black      1     2        NA
5 white      0     3         3
6 black      0     0         1
7 black      0     2        NA

data

df1 <- structure(list(race = c("white", "white", "white", "black", "white", 
"black"), gender = c(0L, 0L, 1L, 1L, 0L, 0L), age.low_tm1 = c(1L, 
1L, 1L, 0L, 0L, 0L), age.medium_tm1 = c(0L, 0L, 0L, 1L, 0L, 1L
), age.high_tm1 = c(0L, 0L, 0L, 0L, 1L, 0L), chol_test.low_tm1 = c(0L, 
0L, 0L, 0L, 0L, 1L), chol_test.high_tm1 = c(0L, 0L, 0L, 0L, 1L, 
0L)), class = "data.frame", row.names = c("1", "2", "3", "4", 
"5", "6"))

CodePudding user response：

We could first define a custom function that allows us to recode dummy variables based on their variable names, below called var_nm2value.

Then we can define a list of recode values, below recode_ls.

Finally, we use purrr::map_dfc in a dplyr::summarise where we use the variable strings we want to create "age" and "chol_test", then ii) select only columns which contain this string, and in each iteration we iii) apply dplyr::across to recode the values, iv) pipe the result in a do.call to get the max and finally v) recode 0s to NA:

# custom function to recode  0/1 dummy variables based on their variable name an 
var_nm2value <- function(x, values_ls) {
  for (val in seq_along(values_ls)) {
    if(grepl(names(values_ls)[val], dplyr::cur_column())) {
      return(ifelse(x == 1L, values_ls[[val]], x))
    } 
  }
}

# define list of recode values
recode_ls <- list(low = 1, medium = 2, high = 3)

library(tidyverse)

# apply functions to data.frame
df1 %>% 
  summarise(race = race,
            gender = gender,
            map_dfc(set_names(c("age", "chol_test")), # i)
                    function(x) { 
                      select(., contains(x)) %>% # ii)
                        summarise("{x}" := across(everything(), var_nm2value, recode_ls) %>% # iii)
                                    do.call("pmax", .) %>% # iv) 
                                    ifelse(. == 0, NA, .))} # v)
            )) 

#>    race gender age chol_test
#> 1 white      0   1        NA
#> 2 white      0   1        NA
#> 3 white      1   1        NA
#> 4 black      1   2        NA
#> 5 white      0   3         3
#> 6 black      0   2         1

^{Created on 2022-01-03 by the reprex package (v0.3.0)}