Home > Back-end >  What is the fastest way to use `dplyr` to find the row-wise mean and variance of a large tibble?
What is the fastest way to use `dplyr` to find the row-wise mean and variance of a large tibble?

Time:01-25

I am looking to generate a mean and variance value for every row in a numeric tibble. With my existing code, what I thought to be a very dplyr appropriate solution, it takes a number of hours to complete for 50,000 rows of about 35 columns.

Is there a way to speed up this operation using only dplyr? I know apply and purrr are options, but I am mostly curious if there is something about dplyr I'm overlooking when performing a large series of calculations like this.

Reproducible example:

library(tidyverse)
library(vroom)
gen_tbl(50000, cols = 40, 
        col_types = paste0(rep("d", 40), collapse = "")) %>%
  rowwise() %>%
  mutate(mean = mean(c_across()),
         var = var(c_across()))

My suspicion lies with rowwise() but I am interested if there is a more nuanced way to solving this with dplyr or if it is just not a problem dplyr is good at.

CodePudding user response:

Pivoting longer might be another approach to consider. For 10k rows (I didn't want to wait for 50k), this code was 100x faster, going from 21 seconds to 0.2 seconds:

df %>%
  mutate(row = row_number()) %>%
  tidyr::pivot_longer(-row) %>%
  group_by(row) %>%
  summarize(mean = mean(value),
            var = var(value)) %>%
  bind_cols(df, .)

CodePudding user response:

If you can get a row-wise variance function like matrixStats::rowVars() you can do this. I'm not sure why c_across() doesn't do it but across() seems to work. I got it from this issue

library(tidyverse)
library(vroom)

gen_tbl(
  rows = 50000, 
  cols = 40, 
  col_types = paste0(rep("d", 40), collapse = "")
) %>%
  mutate(
    mean = rowMeans(across(everything())),
    # var = rowVars(across(everything())),
    .before = everything()
  )

# A tibble: 50,000 x 41
#    mean     X1     X2   
#    <dbl>  <dbl>  <dbl>  
# 1  0.199   -0.715 -1.64   
# 2 -0.212   -0.983 -1.38   
# 3  0.126   -0.135  0.263 
  •  Tags:  
  • Related