I have a data.frame with hundreds of columns where each is a binary ("YES"/"NO") for a possible result on a measurement, and can be collapsed to a short list of variables with multiple options (none of the collapsed variable has more than one YES). I'm trying to figure out the collapsing, but the only solution I could come up with is very inelegant and time consuming on my large dataset.
Here's a toy example to explain what I mean:
> groceries<-tribble(~item, ~prod_potato, ~prod_apple, ~prod_banana, ~day_monday, ~day_tuesday, ~day_wednesday,
1, "N","N","Y","N","N","Y",
2, "Y","N","N","N","N","Y",
3, "Y","N","N","N","Y","N",
4,"N","Y","N","Y","N","N")
# A tibble: 4 x 7
item prod_potato prod_apple prod_banana day_monday day_tuesday day_wednesday
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 N N Y N N Y
2 2 Y N N N N Y
3 3 Y N N N Y N
4 4 N Y N Y N N
Each item can only be ONE of potato, banana or apple, and can only be bought on a specific day, so these multiple columns are really not helpful.
My desired result would be something like:
item prod day
1 banana wednesday
2 potato wednesday
3 potato tuesday
4 apple monday
Here's the solution I came up with, which does the job, but isn't very good:
pivot_longer(groceries,2:4,names_to=c("prod"),names_prefix = "prod_") %>%
filter(value=="Y") %>% select(-value) %>%
pivot_longer(2:4,names_to="day",names_prefix="day_") %>%
filter(value=="Y") %>% select(-value)
# A tibble: 4 x 3
item prod day
<dbl> <chr> <chr>
1 1 banana wednesday
2 2 potato wednesday
3 3 potato tuesday
4 4 apple monday
But I'm 100% sure there's a less cumbersome solution that doesn't require me to repeat this process over some 20-odd collapsed variables.
My absolute ideal solution would be able to group the columns based on the string before the _ and use that as the column name, and put the string after the _ as the value when the value of the original variable is "YES". But I'm willing to work with a slightly more manual solution where I identify the columns to group and the variable name every time.
Can anyone suggest a solution (ideally in tidyverse -- I'm sure data.table would have a super efficient solution, but I could never wrap my head around it)?
Thanks!
CodePudding user response:
You can use pivot_longer to separate all of them into group and options, then just reduce to the Y values (I use summarize just to drop the column, but can easily use filter here) and pivot_wider.
library(dplyr)
library(tidyr)
# library(tidyverse)
groceries %>%
pivot_longer(-item,
names_to = c("group", "options"),
names_sep = "_") %>%
group_by(item, group) %>%
summarize(options = options[value == "Y"],
.groups = "drop") %>%
pivot_wider(names_from = "group",
values_from = "options")
#> # A tibble: 4 × 3
#> item day prod
#> <dbl> <chr> <chr>
#> 1 1 wednesday banana
#> 2 2 wednesday potato
#> 3 3 tuesday potato
#> 4 4 monday apple
CodePudding user response:
library(data.table)
setDT(groceries)
dt <- melt(groceries, id.vars = c("item"))[value == "Y"]
dt[, c("A", "B") := tstrsplit(variable, "_")]
dcast(dt, item ~ A, value.var = c("B"))
# item day prod
# 1: 1 wednesday banana
# 2: 2 wednesday potato
# 3: 3 tuesday potato
# 4: 4 monday apple
Or as one-liner:
dcast(melt(groceries, id.vars = c("item"))[value == "Y"][, c("A", "B") := tstrsplit(variable, "_")], item ~ A, value.var = c("B"))
