I have a large data with more than 1 billion observations, and I need to perform some string operations which is slow.
My code is as simple as this:
DT[, var := some_function(var2)]
If I'm not mistaken, data.table uses multithread when it is called with by, and I'm trying to parallelize this operation utilizing this. To do so, I can make an interim grouper variable, such as
DT[, grouper := .I %/% 100]
and do
DT[, var := some_function(var2), by = grouper]
I tried some benchmarking with a small sample of data, but surprisingly I did not see a performance improvement. So my questions are:
- Does
data.tableuse multithreading when it's used withby? - If so, is there a condition that multithreading is enabled / disabled?
- Is there a way that user can "enforce"
data.tableto use multithreading here?
FYI, I see that multithreading enabled with half of my cores when I import data.table, so I guess there's no openMP issue here.
CodePudding user response:
I got answers from data.table developers from data.table github.
Here's a summary:
Finding groups of
byvariable itself is parallelized always, but more importantly,If the function on
jis generic (User Defined Function) then there's no parallelization.Operations on
jis parallelized if the function is (gforce) optimized (Expressions in j which contain only the functionsmin,max,mean,median,var,sd,sum,prod,first,last,head,tail)
So, it is advised to do parallel operation manually if the function on j is generic, but it may not always guarantee speed gain. Reference
==Solution==
In my case, I encountered vector memory exhaust when I plainly used DT[, var := some_function(var2)] even though my server had 1TB of ram, while data was taking 200GB of memory.
I used split(DT, by='grouper') to split my data.table into chunks, and utilized doFuture foreach %dopar% to do the job. It was pretty fast.
