I need to find a very efficient way to calculate the number of times each of 3 specific values occurs in a data frame. Here is what my data frame looks like:
The only values that can be found there are "0/1", "1/1", and "0/0". I want the output to be in a form of 3 different variables containing the respective number of occurrences.
Edit:
As mentioned in the comments, I tried to use table(unlist(DF)), but I reckon my data frame is too big.
CodePudding user response:
As suggested by RobZ in the comments, table(unlist(df, use.names = F)) is quite speedy, but using a matrix or vector-like structure is faster. Seeing this SO post, and inspecting table, it calls tabulate under the hood. We can dig into it a little if we are sure no edge cases exist.
bench::mark(
one = table(unlist(df, use.names = F)),
two = .Internal(tabulate(
unlist(df, use.names = F ) |> # make a vector
(\(.) ifelse(. == "0/0", 1L, ifelse(. == "0/1", 2L, 3L)))(), # converting to integer
3)), check = F
)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:tm> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 one 62.8us 67us 13689. 0B 6.26 6555 3 479ms
2 two 11.1us 12.3us 76757. 0B 7.68 9999 1 130ms
Since we defined the "factor levels" ourselves, we can infer the meaning of the numbers. Note that the output is not the same, it is the cost of speed versus convenience. Another thing to consider is premature optimization: the code in this post is just for educational purposes, base R functions are usually quite optimized for the general use case, minor tweaks probbaly take longer to implement than they save on top of the risk of errors if edge cases are actually present.
CodePudding user response:
If data has modest dimensions, then you can do:
f1 <- function(data, levels) {
c(table(factor(unlist(data, FALSE, FALSE), levels)))
}
f1(data, c("0/1", "1/1", "0/0"))
If not, then you may need a different function, because f1 requires you to allocate memory for five prod(dim(data))-length vectors: the unlist result, the factor result, and three intermediate objects inside of table
([1],
[2],
[3]).
f2 below is more verbose but much more efficient:
- It computes column-wise counts then takes their sum to obtain the result. In this way, it avoids creating vectors of length greater than
nrow(data). - It uses
tabulateinstead oftableto do the counting. You can think oftabulateas a low level analogue oftable. With some care, you can usetabulateto obtain thetableresult with any of the associated overhead.
f2 <- function(data, levels) {
tt <- function(x, levels) tabulate(factor(x, levels), length(levels))
cc <- vapply(data, tt, integer(3L), levels, USE.NAMES = FALSE)
res <- as.integer(.rowSums(cc, 3L, length(data))) # can delete 'as.integer' if worried about integer overflow ...
names(res) <- levels
res
}
f2(data, c("0/1", "1/1", "0/0"))
Here is a test using a data frame with 1 million rows and 100 variables:
s <- c("0/1", "1/1", "0/0")
set.seed(1L)
data <- as.data.frame(replicate(100L, sample(s, 1e 06L, TRUE), simplify = FALSE))
f1(data, s)
## 0/1 1/1 0/0
## 33329488 33332464 33338048
f2(data, s)
## 0/1 1/1 0/0
## 33329488 33332464 33338048
microbenchmark::microbenchmark(f1(data, s), f2(data, s))
## Unit: seconds
## expr min lq mean median uq max neval
## f1(data, s) 2.883588 2.956380 3.172275 3.114462 3.342997 3.724857 100
## f2(data, s) 1.170202 1.185615 1.203229 1.194077 1.207591 1.328175 100


