How to properly implement python multiprocessing for expensive image/video tasks?-CodePudding

I'm running on a pretty basic quad-core machine where multiprocessing.cpu_count() = 8 with something like:

from itertools import repeat
from multiprocessing import Pool


def expensive_function(list_of_values, some_param, another_param):
    do_some_python_pillow_tasks()
    do_some_ffmpeg_tasks()


if __name__ == '__main__':
    values = [
        ['a', 'b', 'c'],
        ['x', 'y', 'z'],
        # ...
        # there can be MANY items in this list, let's say 1000
    ]
    
    pool = Pool(processes=len(values))
    pool.starmap(
        expensive_function,
        zip(values, repeat('yada yada yada'), repeat('hello world')),
    )
    pool.close()

None of the 1,000 tasks will have problems with each other, in theory they can all be run at the same time.

Using multiprocessing.Pool definitely helps speed up the total duration, but am I using multiprocessing to the best of it's ability? Are you supposed to pass in the total number of tasks (1000) to Pool(processes=?) or the number of CPUs (8)?

Ultimately I want all (potentially 1000) tasks to complete as fast as possible. This may be a stupid question, but can you utilize the GPU to help speed up processing?

CodePudding user response：

Using multiprocessing.Pool definitely helps speed up the total duration, but am I using multiprocessing to the best of it's ability? Are you supposed to pass in the total number of tasks (1000) to Pool(processes=?) or the number of CPUs (8)?

Pool creates many CPython processes and processes is the number of workers to create. Creating about 1000 processes is really not a good idea since creating a process is expensive. I advise you to let the default parameter (or to check if using 4 processes is better in your case).

This may be a stupid question, but can you utilize the GPU to help speed up processing?

No. You cannot use it transparently. You need to rewrite your code to use it and this is generally pretty hard. However, the ffmpeg may use it already. If so, running this task in parallel should certainly not be much faster (it can actually even be slower) since the GPU is a shared resource and the multiple process will compete for its use (since GPU tasks are always massively parallel in practice).

CodePudding user response：

Q : _{"... am I using multiprocessing to the best of it's ability?"}

A :
Well, that actually does not matter here at all.

Your happened to enjoy a such use-case, where the so called embarasingy parallel process-orchestration may save most of otherwise present problems.

Python multithreading is irrelevant here, as it keeps all threads wait one after another for acquiring central Python GIL-lock, so using it is rather an antipattern if you wish to gain processing speed here.

Python multiprocessing is inappropriate here, even for as small number as 4 or 8 worker-processes ( the less for 1k ), as it

first
spends (in further context negligible [TIME]- and [SPACE]-domains costs on spawning new, independent Python-interpreter processes, copied full-scale, i.e. with all its internal-state & all the data-structures (! expect RAM-/SWAP-thrashing whenever your host physical-memory gets over-saturated with that many copies of the same things & virtual-memory manangement-service of the O/S starts to, concurrently to running your "usefull" work, orchestrate memory SWAP-ins / SWAP-outs, as it thinks the just-O/S-scheduled-process needs to fetch data, that cannot fit/stay in-RAM and so gets not N x 100 [ns] far from CPU, but Q x 10.000.000 [ns] far on-HDD - yes, you read correctly, suddenly being many orders of magnitude slower just to re-read the "own" data, accidentally swapped away CPU gets the less available for your processing, as it has to perform also all the introduced SWAP-I/O processing. Nasty, isn't it? Yet, it is not all, what hurts you... )

next ( and repeated per each of the 1.000 cases ... )
you will have to pay ( CPU-wise MEM-I/O-wise O/S-IPC-wise )
another awful penalty, here for moving data ( parameters ) from the "main" Python-interpreter process to the "spawned" Python-interpreter process, using DATA-Serialiser( at CPU MEM-I/O add-on costs ) DATA-moving( O/S-IPC-service add-on costs, yes, DATA-size matters, again ) DATA-Deserialise( again at CPU MEM-I/O add-on costs ) all doing that just to make DATA ( parameters ) somehow appear "inside" the other Python-interpreter, whose GIL-lock will not compete with your central and other Python-interpreters ( which is fine, yet on this awfully gigantic sum of add-on costs? Not so nice looking as we get understand details, is it? )

What can be done instead?

a) split the list of independent values, as was posted above, in say 4 parts ( quad-core, 2 threads each, CPU ), and

b) let the embarasingly parallel (independent) problem get solved in a pure-[SERIAL] fashion, by 4 Python processes, each one launched on respective quarter of the list

There will be zero add-on cost for doing so,
there will be zero add-on SER/DES penalty for 1000 tasks' data distribution and results' recollection, and
there will be reasonable CPU-core distributed workload ( thermal throttling will, as the CPU-core temperatures may and will grow, appear for all of 'em - so no magic but sufficient CPU-cooling can save us here anyway )

Except for using a magic wand, there is no other magic possible here