I'm new to R and can not get my head arround why some very basic script does not perform one hot encoding in a windows-environment while it performs totally well in a linux-environment. As I have to work within the failing windows-environment I'd like to make the script perform one hot encoding.
This happenes within windows (one hot fail)
> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.14.2 mltools_0.3.5
loaded via a namespace (and not attached):
[1] compiler_4.1.1 Matrix_1.3-4 tools_4.1.1 grid_4.1.1 lattice_0.20-44
>
> customers <- data.frame(
id=c(10, 20, 30, 40, 50),
gender=c('male', 'female', 'female', 'male', 'female'),
mood=c('happy', 'sad', 'happy', 'sad','happy'),
outcome=c(1, 1, 0, 0, 0))
>
> customers
id gender mood outcome
1 10 male happy 1
2 20 female sad 1
3 30 female happy 0
4 40 male sad 0
5 50 female happy 0
>
> library(data.table)
> library(mltools)
>
> customers_1h <- one_hot(as.data.table(customers))
>
> customers_1h
id gender mood outcome
1: 10 male happy 1
2: 20 female sad 1
3: 30 female happy 0
4: 40 male sad 0
5: 50 female happy 0
while this is what I'd expect to happen - one hot encoding
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE Leap 15.3
Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8
[4] LC_COLLATE=de_DE.UTF-8 LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0
>
> customers <- data.frame(
id=c(10, 20, 30, 40, 50),
gender=c('male', 'female', 'female', 'male', 'female'),
mood=c('happy', 'sad', 'happy', 'sad','happy'),
outcome=c(1, 1, 0, 0, 0))
>
> customers
id gender mood outcome
1 10 male happy 1
2 20 female sad 1
3 30 female happy 0
4 40 male sad 0
5 50 female happy 0
>
> library(data.table)
data.table 1.14.2 using 8 threads (see ?getDTthreads). Latest news: r-datatable.com
> library(mltools)
>
> customers_1h <- one_hot(as.data.table(customers))
>
> customers_1h
id gender_female gender_male mood_happy mood_sad outcome
1: 10 0 1 1 0 1
2: 20 1 0 0 1 1
3: 30 1 0 1 0 0
4: 40 0 1 0 1 0
5: 50 1 0 1 0 0
At least the same packages seem to be installed. So why does one hot encoding not take place without at least some error? Can anyone give me a hint how I get windows behaving?
Many thanks in advance
Chris
CodePudding user response:
I think this has to do with your R versions, not the platform. One of the key defaults for creating data.frames, stringsAsFactors, got a new default (=FALSE) in R 4.0 after years of tripping up unsuspecting new users. However, some packages, such as it seems mltools, expect the kind of data frame that would be created using the old default, stringsAsFactors = TRUE. For more: https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/index.html
I was able to replicate the problem and could fix it by setting stringsAsFactors = TRUE. (BTW, it looks like mltools::onehot expects a data.table as input, so I'm not sure there's a way to avoid using that package.)
Doesn't work:
customers <- data.frame(
id=c(10, 20, 30, 40, 50),
gender=c('male', 'female', 'female', 'male', 'female'),
mood=c('happy', 'sad', 'happy', 'sad','happy'),
outcome=c(1, 1, 0, 0, 0))
mltools::one_hot(data.table::as.data.table(customers))
id gender mood outcome
1: 10 male happy 1
2: 20 female sad 1
3: 30 female happy 0
4: 40 male sad 0
5: 50 female happy 0
Works:
customers <- data.frame(
id=c(10, 20, 30, 40, 50),
gender=c('male', 'female', 'female', 'male', 'female'),
mood=c('happy', 'sad', 'happy', 'sad','happy'),
outcome=c(1, 1, 0, 0, 0), stringsAsFactors = TRUE)
mltools::one_hot(data.table::as.data.table(customers))
id gender_female gender_male mood_happy mood_sad outcome
1: 10 0 1 1 0 1
2: 20 1 0 0 1 1
3: 30 1 0 1 0 0
4: 40 0 1 0 1 0
5: 50 1 0 1 0 0
