Creating a single column from multiple columns combined-CodePudding

A dataset that I am using has recorded respondents' ethnicity. Responses are recorded across multiple variables, and respondents were allowed to pick more than one. Example:

ethnicity1     ethnicity2     ethnicity3    ethnicity4     ethnicity5     ethnicity6
         1              0              0             0              0              0        
         0              2              0             0              0              0    
         0              0              3             4              0              0

Each ethnicity has its own column. I have recoded each column using the recode command so that different numbers represent different ethnicities (i.e., black would be 1, white would be 2, etc.) to try and make a single ethnicity variable

A) create a single column from the multiple columns combined

B) have it so that anyone who reported multiple columns gets designated "multiple".

My expected output would look like this:

Ethnicity
      1
      2
     999

(I'm not sure whether it's best to have a numerical value signify multiple races for coding purposes, or have it be a character value as in "multiple')

Initially, I thought to do this but it's not going as I'd hope.

Ethnicity <- df %>% dplyr::na_if(0)
## create column for  ethnicity
Ethnicity %>% unite("RaceEthnicity", ethnicity1:ethnicity5, na.rm = TRUE, remove = FALSE)

I'm a bit of an R novice, so any help would be greatly appreciated!

CodePudding user response：

Here's a tidyverse solution. I assume your data has a column for the respondent. I've added that and named it ID.

To understand what's happening you can run the code by successively adding each line, up to but not including the pipe (%>%) and look at the output.

The columns used for pivot_longer will depend on what your real data looks like: here the ethnicities are in 1-6, ID in 7.

library(dplyr)
library(tidyr)

mydata %>% 
  # add IDs for respondent
  mutate(ID = LETTERS[1:3]) %>%
  # convert to 'long' format 
  pivot_longer(1:6) %>% 
  # remove zero value rows
  filter(value != 0) %>% 
  # group by person
  group_by(ID) %>% 
  # use value where there is one row per person, otherwise use 999
  # we need doubles for both values (existing are int)
  summarise(ethnicity = case_when(n() == 1 ~ as.double(value), 
                                  TRUE ~ 999)) %>% 
  ungroup() %>% 
  # discard duplicate rows
  distinct()

Result:

ID    ethnicity
  <chr>     <dbl>
1 A             1
2 B             2
3 C           999

Your example data with corrected column names:

mydata <- structure(list(ethnicity1 = c(1L, 0L, 0L), 
                         ethnicity2 = c(0L, 2L, 0L), 
                         ethnicity3 = c(0L, 0L, 3L), 
                         ethnicity4 = c(0L, 0L, 4L), 
                         ethnicity5 = c(0L, 0L, 0L), 
                         ethnicity6 = c(0L, 0L, 0L)), 
                    class = "data.frame", 
                    row.names = c(NA, -3L))

CodePudding user response：

Here's an approach with dplyr and purrr:

library(dplyr);library(purrr)
df %>%
  mutate(RaceEthnicity = select(cur_data(), enthnicity1:ethnicity6) %>%
                                  {case_when(pmap_lgl(., ~ all(is.na(.x))) ~ NA_real_,
                                             rowSums(.,na.rm = TRUE) == 0 ~ 0,
                                             rowSums(.,na.rm = TRUE) != pmap_int(.,pmax,na.rm = TRUE) ~ 999,
                                             TRUE ~ rowSums(.,na.rm = TRUE))})
  enthnicity1 enthnicity2 ethnicity3 enthnicity4 enthnicity5 ethnicity6 RaceEthnicity
1           1           0          0           0           0          0             1
2           0           2          0           0           0          0             2
3           0           0          3           4           0          0           999

This is probably not the most novice friendly approach, but you can define the columns within the select call. After select, we pass the data into a set of {} so that the data is represented by the . symbol. From there, we use dplyr::case_when to test multiple conditions.

If all columns are NA, return NA
If the rowSums = 0, return 0
If the rowSums don't equal the row max, return 999
Otherwise, return the rowSum (because it will only be length one and be the ethnicity of interest).

Note you misspelled column names.

Data:

df <- structure(list(enthnicity1 = c(1L, 0L, 0L), enthnicity2 = c(0L, 
2L, 0L), ethnicity3 = c(0L, 0L, 3L), enthnicity4 = c(0L, 0L, 
4L), enthnicity5 = c(0L, 0L, 0L), ethnicity6 = c(0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -3L))

CodePudding user response：

Here is another tidyverse solution. Here, I create a new column (with mutate) and then use pmap to select all of the columns that start with ethnicity. Then, I put everything from that row into a list. Then, I remove all of the 0s from that list and replace any row that has more than one value with 999 and keep only 1 unique value.

library(tidyverse)

df %>%
  mutate(Ethnicity = pmap(
    select(., starts_with("ethnicity")),
    ~ c(...) %>%
      keep(~ all(. != 0)) %>%
      replace(length(.) > 1, 999) %>%
      unique
  ))

If you need to simply select the columns (since your real data may not actually have the word "ethnicity" for each column), then you can just put in the column index (e.g., c(1:6)) or use the column names (shown below).

df %>%
  mutate(Ethnicity = pmap(
    select(., c("ethnicity1", "ethnicity2", "ethnicity3", "ethnicity4", "ethnicity5", "ethnicity6")),
    ~ c(...) %>%
      keep(~ all(. != 0)) %>%
      replace(length(.) > 1, 999) %>%
      unique
  ))

Another option is to use mutate with ifelse and change any row with multiple values to 999.

library(tidyverse)

df %>%
  mutate(Ethnicity = pmap(select(., starts_with("ethnicity")),  ~ c(...) %>%
                            keep( ~ all(. != 0)))) %>%
  rowwise %>%
  mutate(Ethnicity = ifelse(length(Ethnicity) > 1, 999, Ethnicity)) %>%
  select(Ethnicity)

Output

# A tibble: 3 × 1
# Rowwise: 
  Ethnicity
      <dbl>
1         1
2         2
3       999

Data

df <-
  structure(
    list(
      ethnicity1 = c(1L, 0L, 0L),
      ethnicity2 = c(0L, 2L, 0L),
      ethnicity3 = c(0L, 0L, 3L),
      ethnicity4 = c(0L, 0L, 4L),
      ethnicity5 = c(0L, 0L, 0L),
      ethnicity6 = c(0L, 0L, 0L)
    ),
    class = "data.frame",
    row.names = c(NA,-3L)
  )

CodePudding user response：

in Base R you could do:

aggregate(.~row, data.frame(which(df>0, TRUE)), \(x) if(length(x)>1)999 else x)

  row col
1   1   1
2   2   2
3   3 999

CodePudding user response：

I would propose another strategy to consider. It seems that if the new number of ethnicityn columns is limited (less than 32 in the simple case) the better approach could be using a bitmask. This way is used in many languages for the alike purposes for instance in MySQL list columns, in Pascal/Delphi sets, etc. In this case, the resulting column will hold the following values: c(1L, 2L, 12L)

CodePudding user response：

Maybe that simple? Or do I overlook something?

library(dplyr)
df %>% 
  mutate(Ethnicity = rowSums(select(., contains("ethnicity"))),
         Ethnicity = ifelse(Ethnicity > 2, 999, Ethnicity))

  ethnicity1 ethnicity2 ethnicity3 ethnicity4 ethnicity5 ethnicity6 Ethnicity
1          1          0          0          0          0          0         1
2          0          2          0          0          0          0         2
3          0          0          3          4          0          0       999