Lower latency from webcam cv2.VideoCapture-CodePudding

I'm building an application using the webcam to control video games (kinda like a kinect). It uses the webcam (cv2.VideoCapture(0)), AI pose estimation (

What latency-causing steps do we work with?

Forgive, for a moment, a raw-sketch of where we accumulate each of the particular latency-related costs :


   CAM \____/                                     python code GIL-awaiting ~ 100 [ms] chopping
        |::|                                      python code calling a cv2.<function>()
        |::|   __________________________________________-----!!!!!!!-----------
        |::|    ^     2x                                 NNNNN!!!!!!!MOVES DATA!
        |::|    | per-call                               NNNNN!!!!!!!    1.THERE
        |::|    |     COST                               NNNNN!!!!!!!    2.BACK
        |::|    |           TTTT-openCV::MAT into python numpy.array
        |::|    |          ////       forMAT TRANSFORMER TRANSFORMATIONS
        USBx    |         ////                           TRANSFORMATIONS
        |::|    |        ////                            TRANSFORMATIONS
        |::|    |       ////                             TRANSFORMATIONS
        |::|    |      ////                              TRANSFORMATIONS
        |::|    |     ////                               TRANSFORMATIONS
    H/W oooo   _v____TTTT in-RAM openCV::MAT storage     TRANSFORMATIONS
       /    \        oooo ------ openCV::MAT object-mapper
       \    /        xxxx
 O/S--- °°°°         xxxx
 driver """" _____   xxxx
         \\\\    ^   xxxx ...... openCV {signed|unsigned}-{size}-{N-channels}
 _________\\\\___|___     __________________________________________
 openCV I/O      ^   PPPP                                 PROCESSING
            as F |   ....                                 PROCESSING
               A |   ...                                  PROCESSING
               S |   ..                                   PROCESSING
               T |   .                                    PROCESSING
            as   |   PPPP                                 PROCESSING
      possible___v___PPPP _____ openCV::MAT NATIVE-object PROCESSING

What latencies do we / can we fight _{( here )} against?

Hardware latencies could help, yet changing already acquired hardware could turn expensive

Software latencies of already latency-optimised toolboxes is possible, yet harder & harder

Design inefficiencies are the final & most common place, where latencies could get shaved-off

OpenCV ?
There is not much to do here. The problem is with the OpenCV-Python binding details:

_{... So when you call a function, say res = equalizeHist(img1,img2) in Python, you pass two numpy arrays and you expect another numpy array as the output. So these numpy arrays are converted to cv::Mat and then calls the equalizeHist() function in C . Final result, res will be converted back into a Numpy array. So in short, almost all operations are done in C which gives us almost same speed as that of C .}

This works fine "outside" a control-loop, not in our case, where both of the two transport-costs, transformation-costs and any of new or interim-data storage RAM-allocation-costs result in worsening our control-loop TAT.

So avoid any and all calls of OpenCV-native functions from Python-(behind the bindings' latency extra-miles)-side, no matter how tempting or sweet these may look on the first sight.

HUNDREDS-of-[ms] are a rather bitter cost of ignoring this advice.

Python ?
Yes, Python. Using Python interpreter introduces both latency per se, plus adds problems with concurrency-avoided processing, no matter how many cores does our hardware operate on ( while recent Py3 tries a lot to lower these costs under the interpreter-level software).

We can test & squeeze max out of the (still unavoidable, in 2022) GIL-lock interleaving - check the sys.getswitchinterval() and test increasing this amount for having less interleaved python-side processing ( tweaking is dependent on other your python-application ambitions ( GUI, distributed-computing tasks, python network-I/O workloads, python-HW-I/O-s, if applicable, etc )

RAM-memory-I/O costs ?
Our next major enemy. Using a least-sufficient-enough image-DATA-format, that MediaPipe can work with is the way forward in this segment.

Avoidable losses
All other (our) sins belong to this segment. Avoid any image-DATA-format transformations ( see above, cost may easily grow into HUNDREDS THOUSANDS of [us] just for converting an already acquired-&-formatted-&-stored numpy.array into just another colourmap)

MediaPipe
lists enumerated formats it can work with:

 // ImageFormat

  SRGB: sRGB, interleaved:   one byte for R,
                        then one byte for G,
                        then one byte for B for each pixel.

  SRGBA: sRGBA, interleaved: one byte for R,
                             one byte for G,
                             one byte for B,
                             one byte for alpha or unused.

  SBGRA: sBGRA, interleaved: one byte for B,
                             one byte for G,
                             one byte for R,
                             one byte for alpha or unused.

  GRAY8:        Grayscale,   one byte per pixel.

  GRAY16:       Grayscale,   one uint16 per pixel.

  SRGB48:  sRGB,interleaved, each component is a uint16.

  SRGBA64: sRGBA,interleaved,each component is a uint16.

  VEC32F1:                   One float per pixel.

  VEC32F2:                   Two floats per pixel.

So, choose the MVF -- the minimum viable format -- for gesture-recognition to work and downscale the amount of pixels as possible ( 400x600-GRAY8 would be my hot candidate )

Pre-configure ( not missing the cv.CAP_PROP_FOURCC details ) the native-side OpenCV::VideoCapture processing to do no more than just plain storing this MVF in a RAW-format on the native-side of the Acquisition-&-Pre-processing chain, so that no other post-process formatting takes place.

If indeed forced to ever touch the python-side numpy.array object, prefer to use vectorised & striding-tricks powered operations over .view()-s or .data-buffers, so as to avoid any unwanted add-on latency costs increasing the control-loop TAT.

Options?

eliminate any python-side calls ( as these cost you --2x--the-costs of data-I/O transformation costs ) by precisely configuring the native-side OpenCV processing to match the needed MediaPipe data-format
minimise, better avoid any blocking, if still too skewed control-loop, try using distributed-processing with moving raw-data into other process ( not necessarily a Python-interpreter ) on localhost or within a sub-ms LAN domain ( further tips available here )
try to fit the hot-DATA RAM-footprints to match you CPU-Cache Hierarchy cache-lines' sizing & associativity details ( see this )

What latency-causing steps do we work with?

What latencies do we / can we fight ( here ) against?

Options?

What latencies do we / can we fight _{( here )} against?