Transform the cell values of a column into new binary / dummy variables-CodePudding

I've been looking through similar posts here for guidance but they all seem to focus on splitting strings within cells into distinct columns, whereas in my data all my strings are already split into individual cells.

In other words, my data look like this:

ID	word
1	blue
1	red
1	green
1	yellow
2	blue
2	purple
2	orange
2	green

But I want them to look like this:

ID	blue	red	green	yellow	purple	orange
1	1	1	1	1	0	0
2	1	0	1	0	1	1

I've tried using base R's table() to do this, but I get the error message: Error in table(df) : attempt to make a table with >= 2^31 elements which is strange because there are nowhere near that many elements in my dataset.

I've also tried doing this with pivot_wider(),

df %>% 
  pivot_wider(ID, names_from = word, values_from = word,
       values_fn = length, values_fill = 0)

but running the code above gives me an error message saying that Column 2894 must be named. Use .name_repair to specify repair. But I don't think pivot_wider accepts .name_repair as an argument. I'm also not sure I understand why it would be necessary here since I'm just trying to create a new binary column for every unique cell value.

Any help at all is appreciated! Thanks!

CodePudding user response：

Using data.table:

ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L)

word = c('blue', 'red', 'green', 'yellow', 'blue', 'purple', 'orange', 'green')

library(data.table)

DT = data.table(ID, word)

DT

dcast(DT, ID ~ word, fill = 0L, fun.agg = \(x) ifelse(x == 0, 0, 1))

Output

   ID blue green orange purple red yellow
1:  1    1     1      0      0   1      1
2:  2    1     1      1      1   0      0

CodePudding user response：

Here is a fix using mutate() from dplyr.

library(dplyr)
library(tidyr)

ID <- c(1, 1, 1, 1, 2, 2, 2, 2)
word <- c("blue", "red", "green", "yellow", "blue", "purple", "orange", "green")
dd1 <- data.frame(ID, word)

# A tibble: 2 x 7
     ID  blue   red green yellow purple orange
  <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1     1     1     1     1      1      0      0
2     2     1     0     1      0      1      1

dd1 %>%
  mutate(value = 1) %>%
  pivot_wider(names_from = word, values_from = value, values_fill = 0)

This approach creates the values as a dummy variable and then values_fill = 0 fills in the gaps.