I am looking to generate a mean and variance value for every row in a numeric tibble. With my existing code, what I thought to be a very dplyr appropriate solution, it takes a number of hours to complete for 50,000 rows of about 35 columns.
Is there a way to speed up this operation using only dplyr? I know apply and purrr are options, but I am mostly curious if there is something about dplyr I'm overlooking when performing a large series of calculations like this.
Reproducible example:
library(tidyverse)
library(vroom)
gen_tbl(50000, cols = 40,
col_types = paste0(rep("d", 40), collapse = "")) %>%
rowwise() %>%
mutate(mean = mean(c_across()),
var = var(c_across()))
My suspicion lies with rowwise() but I am interested if there is a more nuanced way to solving this with dplyr or if it is just not a problem dplyr is good at.
CodePudding user response:
Pivoting longer might be another approach to consider. For 10k rows (I didn't want to wait for 50k), this code was 100x faster, going from 21 seconds to 0.2 seconds:
df %>%
mutate(row = row_number()) %>%
tidyr::pivot_longer(-row) %>%
group_by(row) %>%
summarize(mean = mean(value),
var = var(value)) %>%
bind_cols(df, .)
CodePudding user response:
If you can get a row-wise variance function like matrixStats::rowVars() you can do this. I'm not sure why c_across() doesn't do it but across() seems to work. I got it from this issue
library(tidyverse)
library(vroom)
gen_tbl(
rows = 50000,
cols = 40,
col_types = paste0(rep("d", 40), collapse = "")
) %>%
mutate(
mean = rowMeans(across(everything())),
# var = rowVars(across(everything())),
.before = everything()
)
# A tibble: 50,000 x 41
# mean X1 X2
# <dbl> <dbl> <dbl>
# 1 0.199 -0.715 -1.64
# 2 -0.212 -0.983 -1.38
# 3 0.126 -0.135 0.263
