Home > Enterprise >  How to One-Hot Encoding stacked columns in R
How to One-Hot Encoding stacked columns in R

Time:01-21

I have data that look like this

 --- ------- 
|   |  col1 |
 --- ------- 
| 1 |     A |
| 2 |   A,B |
| 3 |   B,C |
| 4 |     B |
| 5 | A,B,C |
 --- ------- 

Expected Output

 --- ----------- 
|   | A | B | C |
 --- ----------- 
|1  | 1 | 0 | 0 |
|2  | 1 | 1 | 0 |
|3  | 0 | 1 | 1 |
|4  | 0 | 1 | 0 |
|5  | 1 | 1 | 1 |
 --- --- --- --- 

How can I encode it like this?

CodePudding user response:

Maybe this could help

df %>%
  mutate(r = 1:n()) %>%
  unnest(col1) %>%
  table() %>%
  t()

which gives

   col1
r   A B C
  1 1 0 0
  2 1 1 0
  3 0 1 1
  4 0 1 0
  5 1 1 1

Data

df <- tibble(
  col1 = list(
    "A",
    c("A", "B"),
    c("B", "C"),
    "B",
    c("A", "B", "C")
  )
)

If your data is given in the following format

df <- data.frame(
  col1 = c("A", "A,B", "B,C", "B", "A,B,C")
)

then you can try

with(
  df,
  table(rev(stack(setNames(strsplit(col1, ","), seq_along(col1)))))
)

which gives

   values
ind A B C
  1 1 0 0
  2 1 1 0
  3 0 1 1
  4 0 1 0
  5 1 1 1

CodePudding user response:

You could use table() with map_df() from purrr to count the occurrences in each element of a list, and return a data frame. Putting it into a function with some post-processing, and using dplyrs data frame unpacking in mutate(), you could do something like this to stay within a data frame context:

library(tidyverse)

one_hot <- function(x) {
  map_df(x, table) %>% 
    mutate_all(as.integer) %>% 
    mutate_all(replace_na, 0L)
}

df <- data.frame(col1 = c("A", "A,B", "B,C", "B", "A,B,C"))

df %>% 
  mutate(
    one_hot(strsplit(col1, ","))
  )
#>    col1 A B C
#> 1     A 1 0 0
#> 2   A,B 1 1 0
#> 3   B,C 0 1 1
#> 4     B 0 1 0
#> 5 A,B,C 1 1 1
  •  Tags:  
  • Related