What is the fastest way to use `dplyr` to find the row-wise mean and variance of a large tibble?-CodePudding

I am looking to generate a mean and variance value for every row in a numeric tibble. With my existing code, what I thought to be a very dplyr appropriate solution, it takes a number of hours to complete for 50,000 rows of about 35 columns.

Is there a way to speed up this operation using only dplyr? I know apply and purrr are options, but I am mostly curious if there is something about dplyr I'm overlooking when performing a large series of calculations like this.

Reproducible example:

library(tidyverse)
library(vroom)
gen_tbl(50000, cols = 40, 
        col_types = paste0(rep("d", 40), collapse = "")) %>%
  rowwise() %>%
  mutate(mean = mean(c_across()),
         var = var(c_across()))

My suspicion lies with rowwise() but I am interested if there is a more nuanced way to solving this with dplyr or if it is just not a problem dplyr is good at.

CodePudding user response：

Pivoting longer might be another approach to consider. For 10k rows (I didn't want to wait for 50k), this code was 100x faster, going from 21 seconds to 0.2 seconds:

df %>%
  mutate(row = row_number()) %>%
  tidyr::pivot_longer(-row) %>%
  group_by(row) %>%
  summarize(mean = mean(value),
            var = var(value)) %>%
  bind_cols(df, .)

CodePudding user response：

If you can get a row-wise variance function like matrixStats::rowVars() you can do this. I'm not sure why c_across() doesn't do it but across() seems to work. I got it from this issue

library(tidyverse)
library(vroom)

gen_tbl(
  rows = 50000, 
  cols = 40, 
  col_types = paste0(rep("d", 40), collapse = "")
) %>%
  mutate(
    mean = rowMeans(across(everything())),
    # var = rowVars(across(everything())),
    .before = everything()
  )

# A tibble: 50,000 x 41
#    mean     X1     X2   
#    <dbl>  <dbl>  <dbl>  
# 1  0.199   -0.715 -1.64   
# 2 -0.212   -0.983 -1.38   
# 3  0.126   -0.135  0.263