Learning faster compositing the hard way

🎯 One of the goals for linuxgreenscreen was to use less than one core on my laptop without losing too much visual (and temporal) quality. I had never used OpenCV or any workflows like MediaPipe, and I hadn’t even used Python 🐍 , so this was a big learning experience.

My initial effort used about three cores of my Ryzen 9 4900HS at 30 FPS for 1280x720, by the time I had finished, I was using about <80% of one core 🥳 .

Here’s a quick summary of the important lessons along the way (before I go into more detail):

  1. The LIVE_STREAM (asynchronous) running mode for MediaPipe’s segmentation used less than half the CPU compared to VIDEO; i.e. the CPU gets a nice rest 😴 (see the mp.tasks.vision.RunningMode documentation).
  2. MediaPipe’s automatic resizing for segmentation was about 20% more CPU hungry than manually applying OpenCV’s 📉 .
  3. OpenCV’s in-place operations are brilliant but you must supply continuous-in-memory arrays.
  4. numpy can wonderfully elide some operations and recycle memory, but it seems haphazard to me.

With regards to the green screen effect, allocating and releasing storage for images was surprisingly expensive 💵 at higher frame rates. It was well worth the effort to understand those last two items in particular, so let’s discuss those further.

In-place OpenCV operations

Many OpenCV image processing operations can be invoked using in-place output via its dst argument. If a function does not have a dst argument then it cannot be performed in place.

📝 Here’s a quick example using cv2.blur

# generate random input
im_shape = (2160, 3840, 3)
x = np.random.randint(low=0, high=255, size=im_shape, dtype=np.uint8)
# allocate space for output
y = np.zeros(im_shape, dtype=np.uint8)
# store the address of the data to check later
address_y = y.__array_interface__['data'][0]
# not necessary to redirect stdout, but keeps output clean for this example
with open(os.devnull, 'w') as devnull:
    with contextlib.redirect_stdout(devnull):
        cv2.blur(src=x, ksize=(1,1), dst=y)

We can check that the address of y hasn’t changed after the call to cv2.blur:

print(address_y - y.__array_interface__['data'][0])
# 0 ... address is the same before and after

The kernel of the blur operator is the identity in this example, so we can also check that y has the correct output (i.e. is a copy of x):

print(np.sum(x - y))
# 0 .... x and y are now the same as (1x1) blur kernel is the identity

We can also set the output to be the same as the input:

# now perform an actual blur on `y`
cv2.blur(src=y, ksize=(2,2), dst=y)
print(address_y - y.__array_interface__['data'][0])
# 0 ... address still the same
print(np.sum(x - y))
# non-zero ... the blur worked!

Painful lesson one: valid arguments

The API places the responsibility on us to ensure valid arguments are passed, NOTE:

  1. The array passed as dst must have the same shape and type as the usual return value (without dst), otherwise OpenCV will silently construct a new array for the output.
  2. A numpy array view:
    • Cannot be used as a destination.
    • May be silently copied if used as a source.

The second item is related to whether the data is continuous in memory or not. I cannot find an official source, but this answers.opencv.org post suggests:

OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY

If we operate on one channel of an image:

z = cv2.blur(src=x[:,:,0], ksize=(1,1))

then OpenCV will copy and re-arrange the data in the view (created via src=x[:,:,0]) into a (new) continuous block. On the other hand you can operate on a row just fine:

z = cv2.blur(src=x[0,:,:], ksize=(1,1))

Continuity can be checked via the flags attribute of a numpy array:

x[0,:,:].flags['C_CONTIGUOUS']
# True
x[:,:,0].flags['C_CONTIGUOUS']
# False

In-place numpy assignment

numpy can do an in-place assignment via y[:] = x or np.copyto(y, x). This usually performs significantly faster than y = x.copy().

📝 Here’s an example where np.copyto performs faster.

# generate random input
im_shape = (900, 1600, 3)
foo = np.random.randint(low=0, high=256, size=im_shape, dtype=np.uint8)
# pre-allocated buffer
y_buf = np.ndarray(im_shape, dtype=np.uint8)

def copy_to_local(x, y_buf):
    y = x.copy()

def copy_to_buffer(x, y_buf):
    np.copyto(y_buf, x)

with cProfile.Profile() as pr:
    for j in np.arange(5000):
        # if we remove the .copy(), the performance _within_ the functions
        # above changes drastically
        copy_to_local(foo.copy(), y_buf)
        copy_to_buffer(foo.copy(), y_buf)
    pr.print_stats(sort=2)

Depending on whether a copy was passed or not the per-frame time elapsed in each copy_* was:

argument to_local to_buffer
foo.copy() 1.48ms 0.75ms
foo 0.46ms 0.45ms

Painful lesson two: inconsistent results

Sometimes, using a smaller array, i.e. 600x800x3, I observed equal performance in all scenarios if I switched np.copyto(y_buf, x) to y_buf[:] =!

I prefer np.copyto from the perspective of clarity; however using y[:] = seems marginally better in some cases - but I don’t understand this result clearly.

Painful lesson three: vectorisation

Another quirk is vectorisation and what I call “work in continuous”.

Sometimes, it’s faster to copy not-continuous memory into a continuous block, perform vectorised in-place operations, then copy back to non-continuous. numpy’s elision can be used here, too. Here’s a trivial example where we want to add then multiply one slice of an array:

foo_slice = foo[:,:,2]
y_slice = y_buf[:,:,2]
work_array = np.ndarray(foo_slice.shape, dtype=np.uint8)

def work_non_continuous(x, y_slice):
    np.copyto(y_slice, x)
    y_slice += 1
    y_slice *= 2

def work_continuous(x, y_slice, work):
    np.copyto(work, x)
    work += 1
    work *= 2
    np.copyto(y_slice, work)

def elided_work(x, y_slice):
    z = 2 * (x + 1)
    np.copyto(y_slice, z)

with cProfile.Profile() as pr:
    for j in np.arange(5000):
        work_non_continuous(foo_slice, y_slice)
        work_continuous(foo_slice, y_slice, work_array)
        elided_work(foo_slice, y_slice)
    pr.print_stats(sort=2)

Can you guess which is quicker? These are the times per call for each function on my machine:

work_non_continuous work_continuous elided_work
3.85ms 1.90ms 2.35ms

Numpy array to virtual video (bonus)

linuxgreenscreen outputs to a virtual v4l2loopback device using write, which creates another opportunity for inadvertent copies. In the video examples that follow we’ll use the interface to write provided by the v4l2py Python module which is also used by linuxgreenscreen.

Firstly, create a virtual device with id 10, e.g. modprobe v4l2loopback devices=1 video_nr=10. To open the device for output using v4l2py:

from v4l2py.device import BufferType, Device, PixelFormat, VideoOutput

(width, height) = (1920, 1080)

sink = Device.from_id(10)
sink.open()
sink.set_format(BufferType.VIDEO_OUTPUT, width, height, PixelFormat.RGB24)
sink_video = VideoOutput(sink)

To benchmark, we’ll use a function that repeatedly calls another writing function (write_fn) a total of n times with a short break between:

import numpy as np
import time

def run_n_frame(n, write_fn):
    for x in np.arange(0, n):
        buffer = np.random.randint(
            0, high=256, size=(height, width, 3), dtype=np.uint8
        )
        write_fn(buffer)
        time.sleep(1 / 60)

The first write_fn that we’ll pass will convert the numpy array using tobytes and pass the result to sink_video.write():

def write_via_tobytes(buffer):
    sink_video.write(buffer.tobytes())

Before we look at another way to do this, the tobytes documentation tells us that this will create a copy of the data:

Constructs Python bytes showing a copy of the raw contents of data memory.

We’d like to avoid a copy and pass the underlying array data directly to write. One option is to use the data attribute, another is to use ctypes as suggested by an answer to ‘How to convert numpy arrays to bytes/BytesIO without copying?’:

import ctypes

def write_via_data_attr(buffer):
    sink_video.write(buffer.data)

def write_via_ctypes(buffer):
    buffer_ctype = ctypes.c_char * (width * height * 3)
    memory_block = (buffer_ctype).from_address(buffer.ctypes.data)
    sink_video.write(memory_block)

Using cProfile.run('run_n_frame(200, <each write fn>)) the elapsed time for each method (per frame) was:

tobytes data ctypes
5.08ms 1.02ms 1.08ms

This confirms that tobytes creates more work via a copy. I can’t think of any reason to prefer the ctypes approach to using the data attribute, so I would recommend the latter as it appears simpler.

Conclusions

This was hard and documentation 📚 was patchy at best, in summary:

  1. The success of the LIVE_STREAM mode in MediaPipe was a fluke 🍀. It is not documented that this should use less CPU.

  2. The OpenCV documentation also doesn’t demonstrate successful vs unsuccessful ‘in place’ usage and the requirements for continuous memory are not in the OpenCV Python binding guide. Elsewhere in the optimisation recommendations it provides the potentially bad advice:

    Never make copies of an array unless it is necessary. Try to use views instead. Array copying is a costly operation.

    NOTE: We have shown that using a view as an input argument will create a copy if the view is not continuous

  3. numpy and Python’s assignment syntax is a source of amusement. copyto can degrade performance of out-of-place operations in other parts of the code, y[:] = seems more reliable.

  4. numpy appears to take advantage of vectorisation when operations are on continuous blocks of memory, eliding or copying to a continous block (then copying back, provided everything is re-used) can be faster. None of this is documented clearly.

That’s all I’ve got for now.

Update: I was going to return to linuxgreenscreen to cover using memory mapped files supplied by v4l2 to reduce the number of copy operations further, however I contacted Tiago Cutinho and we worked out how to incorporate this into linuxpy. I’ll revisit this again when I start work on an OBS plugin.

👋