Home > Software engineering >  Pivoting a table with multiple binary measures of the same variable in R
Pivoting a table with multiple binary measures of the same variable in R

Time:01-27

I have a data.frame with hundreds of columns where each is a binary ("YES"/"NO") for a possible result on a measurement, and can be collapsed to a short list of variables with multiple options (none of the collapsed variable has more than one YES). I'm trying to figure out the collapsing, but the only solution I could come up with is very inelegant and time consuming on my large dataset.

Here's a toy example to explain what I mean:

> groceries<-tribble(~item, ~prod_potato, ~prod_apple, ~prod_banana, ~day_monday, ~day_tuesday, ~day_wednesday,
                   1, "N","N","Y","N","N","Y",
                   2, "Y","N","N","N","N","Y",
                   3, "Y","N","N","N","Y","N",
                   4,"N","Y","N","Y","N","N")

# A tibble: 4 x 7
   item prod_potato prod_apple prod_banana day_monday day_tuesday day_wednesday
  <dbl> <chr>       <chr>      <chr>       <chr>      <chr>       <chr>        
1     1 N           N          Y           N          N           Y            
2     2 Y           N          N           N          N           Y            
3     3 Y           N          N           N          Y           N            
4     4 N           Y          N           Y          N           N     

Each item can only be ONE of potato, banana or apple, and can only be bought on a specific day, so these multiple columns are really not helpful.

My desired result would be something like:

    item    prod    day
       1    banana  wednesday
       2    potato  wednesday
       3    potato  tuesday
       4    apple   monday

Here's the solution I came up with, which does the job, but isn't very good:

pivot_longer(groceries,2:4,names_to=c("prod"),names_prefix = "prod_") %>% 
     filter(value=="Y") %>% select(-value) %>% 
  pivot_longer(2:4,names_to="day",names_prefix="day_") %>% 
     filter(value=="Y") %>% select(-value)
# A tibble: 4 x 3
   item prod   day      
  <dbl> <chr>  <chr>    
1     1 banana wednesday
2     2 potato wednesday
3     3 potato tuesday  
4     4 apple  monday   

But I'm 100% sure there's a less cumbersome solution that doesn't require me to repeat this process over some 20-odd collapsed variables.

My absolute ideal solution would be able to group the columns based on the string before the _ and use that as the column name, and put the string after the _ as the value when the value of the original variable is "YES". But I'm willing to work with a slightly more manual solution where I identify the columns to group and the variable name every time.

Can anyone suggest a solution (ideally in tidyverse -- I'm sure data.table would have a super efficient solution, but I could never wrap my head around it)?

Thanks!

CodePudding user response:

You can use pivot_longer to separate all of them into group and options, then just reduce to the Y values (I use summarize just to drop the column, but can easily use filter here) and pivot_wider.

library(dplyr)
library(tidyr)
# library(tidyverse)

groceries %>%
  pivot_longer(-item,
               names_to = c("group", "options"),
               names_sep = "_") %>%
  group_by(item, group) %>%
  summarize(options = options[value == "Y"],
            .groups = "drop") %>%
  pivot_wider(names_from = "group",
              values_from = "options")
#> # A tibble: 4 × 3
#>    item day       prod  
#>   <dbl> <chr>     <chr> 
#> 1     1 wednesday banana
#> 2     2 wednesday potato
#> 3     3 tuesday   potato
#> 4     4 monday    apple

CodePudding user response:

library(data.table)
setDT(groceries)

dt <- melt(groceries, id.vars = c("item"))[value == "Y"]
dt[, c("A", "B") := tstrsplit(variable, "_")]
dcast(dt, item ~ A, value.var = c("B"))

#    item       day   prod
# 1:    1 wednesday banana
# 2:    2 wednesday potato
# 3:    3   tuesday potato
# 4:    4    monday  apple

Or as one-liner: dcast(melt(groceries, id.vars = c("item"))[value == "Y"][, c("A", "B") := tstrsplit(variable, "_")], item ~ A, value.var = c("B"))

  •  Tags:  
  • Related