I am trying to calculate quantiles for every "slice" of a dataset, in order to get some kind of "confidence intervals" at a 99% level. I manage this with base R, but it is excruciatingly slow. Any idea to speed up, or for a better approach, is welcome.
a <- (1:20000)/100
b <- 20001:40000
speedseq <- data.frame(a, b)
work_quantile <- rep(NA, nrow(speedseq))
myfunc <- function() {
for(i in 1: nrow(speedseq)) {
work_quantile[i] <- quantile(speedseq$b[speedseq$a>=(speedseq$a[i] - 1) &
speedseq$a<=speedseq$a[i]], na.rm = T, probs = 0.99)
if(i%000==0) print(round(i/nrow(speedseq),3))
}
mean(is.na(work_quantile))
}
microbenchmark::microbenchmark(myfunc(), times = 1)
Unit: seconds
expr min lq mean median uq max neval
myfunc() 5.185645 5.185645 5.185645 5.185645 5.185645 5.185645 1
CodePudding user response:
You could parallelize it.
library(parallel)
cl <- makeCluster(detectCores() - 1)
clusterExport(cl, 'speedseq')
r0 <- parSapply(cl, 1:nrow(speedseq), \(i) unname(quantile(speedseq$b[speedseq$a >= (speedseq$a[i] - 1) &
speedseq$a <= speedseq$a[i]], na.rm=T, probs=0.99)))
stopifnot(all.equal(work_quantile, r0))
stopCluster(cl)
Other approaches:
If you want slices of 1, 1:2, 1:3, ... 1:nrow, you could do
r1 <- vapply(1:nrow(speedseq), \(x) quantile(speedseq$b[seq.int(1, x)], .99), numeric(1))
If you want 1:100, 2:101, 2:102, ..., (nrow - 99):nrow, you could do
r2 <- vapply(1:(nrow(speedseq) - 99), \(x) quantile(speedseq$b[seq.int(x, x 99)], .99), numeric(1))
If you want just slices of 100 each you could do
r3 <- vapply(seq(1, nrow(speedseq), 100), \(x) quantile(speedseq$b[seq.int(x, x 99)], .99), numeric(1))
CodePudding user response:
An approach using data.table which uses the data.table non-equi join: (not sure how memory efficient it will be but it took 4 seconds to run locally)
For each row, pre-define the upper bound, lower bound and row id:
library(data.table)
setDT(speedseq)
speedseq[, upper_bound := a]
speedseq[, lower_bound := a - 1]
speedseq[, row_id := .I]
Do a non-equi on itself. Read as:
In the first line, let's call the
speedseqon the righti, and thespeedseqon the leftXThis is your data.table merge syntax which means "for each row in
i, find the rows that match fromXThe match condition is that
X.a >= i.lower_boundandX.a <= i.upper_bound. Again, taking thelower_boundandupper_boundfrom one row ofi, this is saying "give me all rows in X whereais between these boundsThe
.(row_id = i.row_id, a = i.a, b)specifies what I want to keep, meaning I want to keep therow_idandafromi, and the values ofbthat came fromXLastly, compute the required quantile by
row_idspeedseq[speedseq, .(row_id = i.row_id, a = i.a, b), # Non equi join on = .(a >= lower_bound, a <= upper_bound) ][, .(work_quantile = quantile(b, na.rm = T, probs = 0.99)), by = .(row_id)]
