Improve parallelization in pandas-CodePudding

I am trying to parallelize a function on a pandas DataFrame, and I am wondering why the parallelization is that much slower than single-core solution. I am aware that parallelization has its costs... but I am curious if there is a way to improve the code so that the parallelization would be faster.

In my case I am having a list of User-Ids (300 000 (all strings)) and need to check if the User-Id is also present in another list containing only 10 000 entries.

As I cannot reproduce the original code, so I am giving an example with integers that results in the same performance problem:

import pandas as pd
import numpy as np
from joblib import Parallel, delayed
import time

df = pd.DataFrame({'All': np.random.randint(50000, size=300000)})
selection = pd.Series({'selection': np.random.randint(10000, size=10000)}).to_list()

t1=time.perf_counter()

df['Is_in_selection_single']=np.where(np.isin(df['All'], selection),1,0).astype('int8')
t2=time.perf_counter()
print(t2-t1)

def add_column(x):
    return(np.where(np.isin(x, selection),1,0))

df['Is_in_selection_parallel'] = Parallel(n_jobs=4)(delayed(add_column)(x) for x in df['All'].to_list())
t3=time.perf_counter()
print(t3-t2)

The time-print results in the following:

0.0307

53.07

which means the parallelization is 1766 times slower than the single core.

In my real word example, with the User-Id, the single core takes 1 minute, but the parallelization has not finished after 15min...

I would need the parallization because I need to make this operation a couple of times, so the final script takes several minutes to run. Thank you for any suggestions!

CodePudding user response：

You are splitting the job into too many sub-jobs (1 for each row). This would create a very large overhead cost. You should cut it into smaller number of chunks:

parallel_result = Parallel(n_jobs=4)(delayed(add_column)(x) for x in np.split(df['All'].values, 4))
df['Is_in_selection_parallel'] = np.concatenate(parallel_result)

4 chunks would be 50% faster than the non-parallel version on my platform.

CodePudding user response：

Using a set for membership testing provided a 2.5x improvement on my system. This could be used in addition to parallel computations.

df = pd.DataFrame({'All': np.random.randint(50000, size=300000)})
selection = np.random.randint(10000, size=10000)

s1 = pd.Series(selection)
s2 = set(selection)

def orig(df, s):
    df['Is_in_selection_single'] = np.where(
        np.isin(df['All'], s), 1, 0).astype('int8')
    return sum(df['Is_in_selection_single'])

def modified(df, s):
    df['Is_in_selection_single'] = df['All'].isin(selection)
    return sum(df['Is_in_selection_single'])

Timing results:

%timeit orig(df, s1)
47.1 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit modified(df, s2)
19 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)