How to sum the amount of occurences of specific value in martix-CodePudding

I need to find a very efficient way to calculate every occurrence of 3 specific values in dataframe.

Here is what my data frame looks like, the only values that can be found there are:

0/1
1/1
0/0

I want my output to be in a form of 3 different variables each containing the number of occurrences.

As mentioned in the comments I tried to use table(unlist(DF)) but I reckon my dataframe is too big

CodePudding user response：

As suggested by RobZ in the comments, table(unlist(df, use.names = F)) is quite speedy, but using a matrix or vector-like structure is faster. Seeing this SO post, and inspecting table, it calls tabulate under the hood. We can dig into it a little if we are sure no edge cases exist.

bench::mark(
  one = table(unlist(df, use.names = F)),
  two = .Internal(tabulate(
    unlist(df, use.names = F ) |> # make a vector
      (\(.) ifelse(. == "0/0", 1L, ifelse(. == "0/1", 2L, 3L)))(), # converting to integer
    3)), check = F
)

# A tibble: 2 x 13
  expression      min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 one          62.8us    67us    13689.        0B     6.26  6555     3      479ms
2 two          11.1us  12.3us    76757.        0B     7.68  9999     1      130ms

Since we defined the "factor levels" ourselves, we can infer the meaning of the numbers. Note that the output is not the same, it is the cost of speed versus convenience. Another thing to consider is premature optimization: the code in this post is just for educational purposes, base R functions are usually quite optimized for the general use case, minor tweaks probbaly take longer to implement than they save on top of the risk of errors if edge cases are actually present.

CodePudding user response：

If data has modest dimensions, then you can do:

f1 <- function(data, levels) {
  c(table(factor(unlist(data, FALSE, FALSE), levels)))
}
f1(data, c("0/1", "1/1", "0/0"))

If not, then you need a function that avoids unlist(data), because you may not be able to allocate memory for a prod(dim(data))-length character vector. A less concise but much more efficient approach is to compute column-wise counts then compute their sum:

f2 <- function(data, levels) { 
  tt <- function(x, levels) tabulate(factor(x, levels), length(levels))
  cc <- vapply(data, tt, integer(3L), levels, USE.NAMES = FALSE)
  res <- as.integer(.rowSums(cc, 3L, length(data)))
  names(res) <- levels
  res
}
f2(data, c("0/1", "1/1", "0/0"))

Here is a test using a data frame with 1 million rows and 100 variables:

s <- c("0/1", "1/1", "0/0")

set.seed(1L)
data <- as.data.frame(replicate(100L, sample(s, 1e 06L, TRUE), simplify = FALSE))

f1(data, s)
##      0/1      1/1      0/0 
## 33329488 33332464 33338048

f2(data, s)
##      0/1      1/1      0/0 
## 33329488 33332464 33338048

microbenchmark::microbenchmark(f1(data, s), f2(data, s))
## Unit: milliseconds
##         expr       min        lq      mean    median        uq       max neval
##  f1(data, s) 2339.8481 2383.4571 2625.8864 2503.6262 2814.9704 3205.8431   100
##  f2(data, s)  616.8229  631.8747  657.5103  644.0888  662.9648  792.7816   100