I came accross this unexpected behaviour in data.table. Rows with NAs in a certain column are removed when excluding rows with a certain value as in this example:
library(data.table)
dt_mtcars <- setDT(copy(mtcars))
set.seed(42)
na_rows <- runif(3, min = 1, max = nrow(mtcars))
dt_mtcars[ na_rows, cyl := NA]
dt_mtcars[ is.na(cyl), .N]
#> [1] 3
dt_mtcars <- dt_mtcars[ cyl != 4]
dt_mtcars[ is.na(cyl), .N]
#> [1] 0
Created on 2022-01-27 by the reprex package (v2.0.1)
Excluding rows instead like
library(data.table)
dt_mtcars <- setDT(copy(mtcars))
set.seed(42)
na_rows <- runif(3, min = 1, max = nrow(mtcars))
dt_mtcars[ na_rows, cyl := NA]
dt_mtcars[ is.na(cyl), .N]
#> [1] 3
dt_mtcars <- dt_mtcars[ !cyl %in% 4]
dt_mtcars[ is.na(cyl), .N]
#> [1] 3
Created on 2022-01-27 by the reprex package (v2.0.1)
does have the expected result. Am I wrong in expecting this same result in the first example above? Or is this a bug in data.table?
CodePudding user response:
This isn't a data.table issue.
In the first case you don't select NAs:
NA != 4
[1] NA
In the second case you do:
!NA %in% 4
[1] TRUE
