I'm trying to count the number of NA in multiple columns of my data. Here is a reproducible sample.
structure(list(V2QE38A = c(1, 0, 1, 0, 1, 1, 1, 0, 1, 0), V2QE38B = c(0,
0, 0, 0, 0, 1, 0, 0, 0, 0), V2QE38C = c(1, 1, 0, 3, 2, 0, 0,
3, 1, 1), V2QE38D = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA,
10L), class = "data.frame")
I tried two methods: First one:
dt %>% select(starts_with("V2QE38")) %>% colSums(is.na(.))
And this gives me some results (in short, I have NAs in some columns) Then I tried another one:
colSums(is.na(dt[,c("V2QE38A", "V2QE38B", "V2QE38C", "V2QE38D")]))
And I found no NA in any of these columns.
I think the second result is correct. But I'm just wondering what did I do wrong to get the first result? Thank you!
CodePudding user response:
In the first case, there are multiple functions passed. We may either need to block it with {}
library(dplyr)
dt %>%
select(starts_with("V2QE38")) %>%
{colSums(is.na(.))}
V2QE38A V2QE38B V2QE38C V2QE38D
0 0 0 0
or have another %>%
dt %>%
select(starts_with("V2QE38")) %>%
is.na %>%
colSums
-output
V2QE38A V2QE38B V2QE38C V2QE38D
0 0 0 0
The issue is that colSums is executed first without evaluating the is.na
> dt %>%
select(starts_with("V2QE38")) %>%
colSums(.)
V2QE38A V2QE38B V2QE38C V2QE38D
6 1 12 0
which is the same as the OP's output with colSums(is.na(.))
CodePudding user response:
Base solution using sapply and an annonymous function function(x){sum(is.na(x))}:
data = structure(list(V2QE38A = c(1, 0, 1, 0, 1, 1, 1, 0, 1, 0), V2QE38B = c(0,
0, 0, 0, 0, 1, 0, 0, 0, 0), V2QE38C = c(1, 1, 0, 3, 2, 0, 0,
3, 1, 1), V2QE38D = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA,
10L), class = "data.frame")
sapply(data, function(x){sum(is.na(x))})
# V2QE38A V2QE38B V2QE38C V2QE38D
# 0 0 0 0
Explanation:
sapply applies a function on a list. data.frame is a list, with each vector being an item of this list. The s in sapply is for simplify, so sapply will try to convert the output list (from lapply) to a vector. If the required output is a list (it has some advantages), use lapply instead.
is.na returns a boolean TRUE/FALSE vector. This can be converted to a numeric vector with 1/0 values.
sum converts the TRUE/FALSE vector into a 1/0 vector and sums the values.
Alternative solutions:
Alternatively, instead of treating the data.frame as a list, treat it as a matrix. Then the highly optimized rowSums and colSums can come into play.
colSums(is.na(data))
# V2QE38A V2QE38B V2QE38C V2QE38D
# 0 0 0 0
rowSums(is.na(data))
# 1 2 3 4 5 6 7 8 9 10
# 0 0 0 0 0 0 0 0 0 0
This is great if you have a matrix and want to find where the NAs are.
