I have a data.table with ~300 columns and 50000000 rows that looks something like this:
DT = data.table(spr_1.x=c(1,0,0,1),
spr_15.x=c(0,1,0,1),spr_99.x=c(0,0,0,1),spr_1.y=c(0,0,0,1),
spr_15.y=c(1,1,0,1),spr_99.y=c(1,1,0,1),spr_1.z=c(1,0,0,1),
spr_15.z=c(0,1,0,1),spr_99.z=c(0,0,0,1))
What I need is binary OR between spr_1, spr_15, and spr_99 columns as:
spr_1.x spr_15.x spr_99.x spr_1.y spr_15.y spr_99.y spr_1.z spr_15.z spr_99.z spr_1 spr_15 spr_99
1: 1 0 0 0 1 1 1 0 0 1 1 1
2: 0 1 0 0 1 1 0 1 0 0 1 1
3: 0 0 0 0 0 0 0 0 0 0 0 0
4: 1 1 1 1 1 1 1 1 1 1 1 1
I can do it using the following code:
for (i in c('spr_1','spr_15','spr_99')){
x=paste0(i,'.x')
y=paste0(i,'.y')
z=paste0(i,'.z')
DT[,(i):=Reduce(`|`, .SD),.SDcols=c(x,y,z)]}
for (j in c('spr_1','spr_15','spr_99')) DT[, (j):= as.numeric(get(j))]
but, is there a better, faster, and less RAM-intensive approach?
CodePudding user response:
Maybe you are willing to try (data.table version 1.14.3 for row-wise operation)
for (i in c('spr_1','spr_15','spr_99')) {
DT[,(i) := max(.SD), .SDcols=patterns(paste0(i,"\\.")),by = .I]
}
For stable data.table version,
for (i in c('spr_1','spr_15','spr_99')) {
DT[,(i) := Reduce(pmax,.SD),.SDcols=patterns(paste0(i,"\\."))]
}
P.S: It's better to use do.call to replace Reduce here. see @r2evans 's answer.
CodePudding user response:
Formalizing my comments above into a benchmark. This is not a new answer (other than recommending do.call over Reduce, but that's not very novel), just an extended comment. (I'll delete it if it is too far out of context.)
DTbig <- rbindlist(replicate(1e7, DT, simplify=FALSE))
nrow(DTbig)
# [1] 40000000
Proof of the do.call(pmax, .SD) concept:
DTbig2 <- copy(DTbig)
for (i in c('spr_1','spr_15','spr_99')) {
DTbig2[,(i) := Reduce(pmax, .SD), .SDcols = patterns(paste0(i, "\\."))]
}
for (i in c('spr_1','spr_15','spr_99')) {
DTbig[,(i) := do.call(pmax, .SD), .SDcols = patterns(paste0(i, "\\."))]
}
identical(DTbig, DTbig2)
# [1] TRUE
Benchmark:
bench::mark(
orig = {
for (i in c('spr_1','spr_15','spr_99')) {
DTbig[,(i) := Reduce(`|`, .SD), .SDcols = patterns(paste0(i, "\\."))]
}
},
Reduce = {
for (i in c('spr_1','spr_15','spr_99')) {
DTbig[,(i) := Reduce(pmax,.SD), .SDcols=patterns(paste0(i,"\\."))]
}
},
docall = {
for (i in c('spr_1','spr_15','spr_99')) {
DTbig[,(i) := do.call(pmax, .SD), .SDcols = patterns(paste0(i, "\\."))]
}
}, iterations = 20)
# # A tibble: 3 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 orig 956.22ms 1.08s 0.939 2.68GB 2.35 6 15 6.39s <NULL> <Rprofmem [109 x 3]> <bench_tm~ <tibble ~
# 2 Reduce 1.04s 1.07s 0.943 2.68GB 2.83 5 15 5.3s <NULL> <Rprofmem [306 x 3]> <bench_tm~ <tibble ~
# 3 docall 847.69ms 863.73ms 1.15 1.79GB 1.15 10 10 8.72s <NULL> <Rprofmem [97 x 3]> <bench_tm~ <tibble ~
(I'm really not sure how iterations= is interpreted ...)
FYI, the in Reduce(`|`, .SD) is analogous to your follow-up as.numeric without the extra step.
