I am trying to parallelize a function on a pandas DataFrame, and I am wondering why the parallelization is that much slower than single-core solution. I am aware that parallelization has its costs... but I am curious if there is a way to improve the code so that the parallelization would be faster.
In my case I am having a list of User-Ids (300 000 (all strings)) and need to check if the User-Id is also present in another list containing only 10 000 entries.
As I cannot reproduce the original code, so I am giving an example with integers that results in the same performance problem:
import pandas as pd
import numpy as np
from joblib import Parallel, delayed
import time
df = pd.DataFrame({'All': np.random.randint(50000, size=300000)})
selection = pd.Series({'selection': np.random.randint(10000, size=10000)}).to_list()
t1=time.perf_counter()
df['Is_in_selection_single']=np.where(np.isin(df['All'], selection),1,0).astype('int8')
t2=time.perf_counter()
print(t2-t1)
def add_column(x):
return(np.where(np.isin(x, selection),1,0))
df['Is_in_selection_parallel'] = Parallel(n_jobs=4)(delayed(add_column)(x) for x in df['All'].to_list())
t3=time.perf_counter()
print(t3-t2)
The time-print results in the following:
0.0307
53.07
which means the parallelization is 1766 times slower than the single core.
In my real word example, with the User-Id, the single core takes 1 minute, but the parallelization has not finished after 15min...
I would need the parallization because I need to make this operation a couple of times, so the final script takes several minutes to run. Thank you for any suggestions!
CodePudding user response:
You are splitting the job into too many sub-jobs (1 for each row). This would create a very large overhead cost. You should cut it into smaller number of chunks:
parallel_result = Parallel(n_jobs=4)(delayed(add_column)(x) for x in np.split(df['All'].values, 4))
df['Is_in_selection_parallel'] = np.concatenate(parallel_result)
4 chunks would be 50% faster than the non-parallel version on my platform.
CodePudding user response:
Using a set for membership testing provided a 2.5x improvement on my system. This could be used in addition to parallel computations.
df = pd.DataFrame({'All': np.random.randint(50000, size=300000)})
selection = np.random.randint(10000, size=10000)
s1 = pd.Series(selection)
s2 = set(selection)
def orig(df, s):
df['Is_in_selection_single'] = np.where(
np.isin(df['All'], s), 1, 0).astype('int8')
return sum(df['Is_in_selection_single'])
def modified(df, s):
df['Is_in_selection_single'] = df['All'].isin(selection)
return sum(df['Is_in_selection_single'])
Timing results:
%timeit orig(df, s1)
47.1 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit modified(df, s2)
19 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
