I have data that look like this
--- -------
| | col1 |
--- -------
| 1 | A |
| 2 | A,B |
| 3 | B,C |
| 4 | B |
| 5 | A,B,C |
--- -------
Expected Output
--- -----------
| | A | B | C |
--- -----------
|1 | 1 | 0 | 0 |
|2 | 1 | 1 | 0 |
|3 | 0 | 1 | 1 |
|4 | 0 | 1 | 0 |
|5 | 1 | 1 | 1 |
--- --- --- ---
How can I encode it like this?
CodePudding user response:
Maybe this could help
df %>%
mutate(r = 1:n()) %>%
unnest(col1) %>%
table() %>%
t()
which gives
col1
r A B C
1 1 0 0
2 1 1 0
3 0 1 1
4 0 1 0
5 1 1 1
Data
df <- tibble(
col1 = list(
"A",
c("A", "B"),
c("B", "C"),
"B",
c("A", "B", "C")
)
)
If your data is given in the following format
df <- data.frame(
col1 = c("A", "A,B", "B,C", "B", "A,B,C")
)
then you can try
with(
df,
table(rev(stack(setNames(strsplit(col1, ","), seq_along(col1)))))
)
which gives
values
ind A B C
1 1 0 0
2 1 1 0
3 0 1 1
4 0 1 0
5 1 1 1
CodePudding user response:
You could use table() with map_df() from purrr to count the occurrences
in each element of a list, and return a data frame. Putting it into a
function with some post-processing, and using dplyrs data frame unpacking in
mutate(), you could do something like this to stay within a data frame
context:
library(tidyverse)
one_hot <- function(x) {
map_df(x, table) %>%
mutate_all(as.integer) %>%
mutate_all(replace_na, 0L)
}
df <- data.frame(col1 = c("A", "A,B", "B,C", "B", "A,B,C"))
df %>%
mutate(
one_hot(strsplit(col1, ","))
)
#> col1 A B C
#> 1 A 1 0 0
#> 2 A,B 1 1 0
#> 3 B,C 0 1 1
#> 4 B 0 1 0
#> 5 A,B,C 1 1 1
