Home > database >  R data.table binary OR across specific columns
R data.table binary OR across specific columns

Time:01-12

I have a data.table with ~300 columns and 50000000 rows that looks something like this:

DT = data.table(spr_1.x=c(1,0,0,1),
spr_15.x=c(0,1,0,1),spr_99.x=c(0,0,0,1),spr_1.y=c(0,0,0,1),
spr_15.y=c(1,1,0,1),spr_99.y=c(1,1,0,1),spr_1.z=c(1,0,0,1),
spr_15.z=c(0,1,0,1),spr_99.z=c(0,0,0,1))

What I need is binary OR between spr_1, spr_15, and spr_99 columns as:

   spr_1.x spr_15.x spr_99.x spr_1.y spr_15.y spr_99.y spr_1.z spr_15.z spr_99.z spr_1 spr_15 spr_99
1:       1        0        0       0        1        1       1        0        0     1      1      1
2:       0        1        0       0        1        1       0        1        0     0      1      1
3:       0        0        0       0        0        0       0        0        0     0      0      0
4:       1        1        1       1        1        1       1        1        1     1      1      1

I can do it using the following code:

for (i in c('spr_1','spr_15','spr_99')){
x=paste0(i,'.x')
y=paste0(i,'.y')
z=paste0(i,'.z')
DT[,(i):=Reduce(`|`, .SD),.SDcols=c(x,y,z)]}
for (j in c('spr_1','spr_15','spr_99')) DT[, (j):= as.numeric(get(j))]

but, is there a better, faster, and less RAM-intensive approach?

CodePudding user response:

Maybe you are willing to try (data.table version 1.14.3 for row-wise operation)

for (i in c('spr_1','spr_15','spr_99')) {
     DT[,(i) := max(.SD), .SDcols=patterns(paste0(i,"\\.")),by = .I]    
}

For stable data.table version,

for (i in c('spr_1','spr_15','spr_99')) {
     DT[,(i) := Reduce(pmax,.SD),.SDcols=patterns(paste0(i,"\\."))]    
}

P.S: It's better to use do.call to replace Reduce here. see @r2evans 's answer.

CodePudding user response:

Formalizing my comments above into a benchmark. This is not a new answer (other than recommending do.call over Reduce, but that's not very novel), just an extended comment. (I'll delete it if it is too far out of context.)

DTbig <- rbindlist(replicate(1e7, DT, simplify=FALSE))
nrow(DTbig)
# [1] 40000000

Proof of the do.call(pmax, .SD) concept:

DTbig2 <- copy(DTbig)
for (i in c('spr_1','spr_15','spr_99')) {
  DTbig2[,(i) := Reduce(pmax, .SD), .SDcols = patterns(paste0(i, "\\."))]
}
for (i in c('spr_1','spr_15','spr_99')) {
  DTbig[,(i) := do.call(pmax, .SD), .SDcols = patterns(paste0(i, "\\."))]
}
identical(DTbig, DTbig2)
# [1] TRUE

Benchmark:

bench::mark(
orig = {
  for (i in c('spr_1','spr_15','spr_99')) {
    DTbig[,(i) :=  Reduce(`|`, .SD), .SDcols = patterns(paste0(i, "\\."))]
  }
},
Reduce = {
  for (i in c('spr_1','spr_15','spr_99')) {
    DTbig[,(i) := Reduce(pmax,.SD), .SDcols=patterns(paste0(i,"\\."))]
  }
},
docall = {
  for (i in c('spr_1','spr_15','spr_99')) {
    DTbig[,(i) := do.call(pmax, .SD), .SDcols = patterns(paste0(i, "\\."))]
  }
}, iterations = 20)
# # A tibble: 3 x 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory               time       gc       
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>               <list>     <list>   
# 1 orig       956.22ms    1.08s     0.939    2.68GB     2.35     6    15      6.39s <NULL> <Rprofmem [109 x 3]> <bench_tm~ <tibble ~
# 2 Reduce        1.04s    1.07s     0.943    2.68GB     2.83     5    15       5.3s <NULL> <Rprofmem [306 x 3]> <bench_tm~ <tibble ~
# 3 docall     847.69ms 863.73ms     1.15     1.79GB     1.15    10    10      8.72s <NULL> <Rprofmem [97 x 3]>  <bench_tm~ <tibble ~

(I'm really not sure how iterations= is interpreted ...)

FYI, the in Reduce(`|`, .SD) is analogous to your follow-up as.numeric without the extra step.

  •  Tags:  
  • Related