Learning faster compositing the hard way

One of the goals 🎯 for linuxgreenscreen was to use less than one core on my laptop without losing too much visual (and temporal) quality. I had never used OpenCV or any workflows like MediaPipe, and I hadn’t even used python 🐍, so this was a big learning experience and I’d like to share some of the lessons here.

My initial effort used about three cores of my Ryzen 9 4900HS at 30 FPS for 1280x720, by the time I had finished, I was using about 80-90% of one core 🥳.

I learnt some important things about MediaPipe’s segmentation but more generally I learnt some painful lessons about ‘in place’ operations in both OpenCV and numpy.

  1. MediaPipe segmentation task has two running modes for video: running_mode.VIDEO and running_mode.LIVE_STREAM (see the mp.tasks.vision.RunningMode documentation). VIDEO seems, I think, to have busy waits that eat CPUs alive, whereas LIVE_STREAM uses less than half the CPU by sleeping 😴.
  2. The (input) tensor used by the segmentation model is 144x256x3 in landscape mode (or 256x256x3 for square). MediaPipe automatically resizes any input, and returns output at the original size; however using OpenCV cv2.resize instead can lighten the load: in early tests the segmentation time went from ~17ms to 12ms with a 20% relative drop in CPU usage 📉.
  3. OpenCV’s in-place operations mostly work well with some caveats about the API, while numpy often recycles memory but there are quirky situations where it does not.

Regarding that final point: allocating and releasing storage for images can be surprisingly expensive 💵 at higher frame rates, so it was ultimately worth the effort to experiment and benchmark these.

In-place OpenCV operations

Many OpenCV image processing operations can be invoked using in-place output via its dst argument. If a function does not have a dst argument then it cannot be performed in place.

📝 Here’s a quick example using cv2.blur

# generate random input
im_shape = (2160, 3840, 3)
x = np.random.randint(low=0, high=255, size=im_shape, dtype=np.uint8)
# allocate space for output
y = np.zeros(im_shape, dtype=np.uint8)
# store the address of the data to check later
address_y = y.__array_interface__['data'][0]
# not necessary to redirect stdout, but keeps output clean for this example
with open(os.devnull, 'w') as devnull:
    with contextlib.redirect_stdout(devnull):
        cv2.blur(src=x, ksize=(1,1), dst=y)

We can check that the address of y hasn’t changed after the call to cv2.blur:

print(address_y - y.__array_interface__['data'][0])
# 0 ... address is the same before and after

The kernel is the identity in this example, so we can also check that y has the correct output (i.e. is a copy of x):

print(np.sum(x - y))
# 0 .... x and y are now the same as (1x1) blur kernel is the identity

You can also set the output to be the same as the input:

# now perform an actual blur on `y`
cv2.blur(src=y, ksize=(2,2), dst=y)
print(address_y - y.__array_interface__['data'][0])
# 0 ... address still the same
print(np.sum(x - y))
# non-zero ... the blur worked!

Painful lessons: part one

The API places the responsibility on us to ensure valid arguments are passed, and it was seriously hard for me to learn:

  1. The array passed as dst must have the same shape and type as the usual return value (without dst), otherwise OpenCV will silently not place the result in the array.
  2. A numpy array view:
    • Cannot be used as a destination.
    • May be silently copied if used as a source.

The second item is related to whether the data is continuous in memory or not. I cannot find an official source, but this answers.opencv.org post suggests:

OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY

If you operate on one channel of an image:

z = cv2.blur(src=x[:,:,0], ksize=(1,1))

then OpenCV will have to copy and re-arrange the data in the view that will be created via src=x[:,:,0] into a (new) continuous block. On the other hand you can operate on a row just fine:

z = cv2.blur(src=x[0,:,:], ksize=(1,1))

Continuity can be checked via the flags attribute of a numpy array:

x[0,:,:].flags['C_CONTIGUOUS']
# True
x[:,:,0].flags['C_CONTIGUOUS']
# False

User beware 🚩

In-place numpy assignment

Another discovery was that numpy can do an in-place assignment via y[:] = x which can perform significantly faster than y = x.copy() in quirky circumstances. In others it may perform marginally slower. It’s somehow related to numpy’s usage of cache.

📝 Here’s an example where y[:] = performs faster. Start with a clean python terminal:

# generate random input
im_shape = (900, 1600, 3)
foo = np.random.randint(low=0, high=256, size=im_shape, dtype=np.uint8)
# pre-allocated buffer
y_buf = np.ndarray(im_shape, dtype=np.uint8)

def copy_to_local(x, y_buf):
    y = x.copy()

def copy_to_buffer(x, y_buf):
    y_buf[:] = x

with cProfile.Profile() as pr:
    for j in np.arange(1000):
        # if we remove the .copy(), the performance _within_ the functions
        # above changes drastically
        copy_to_local(foo.copy(), y_buf)
        copy_to_buffer(foo.copy(), y_buf)
    pr.print_stats(sort=2)

Depending on whether a copy was passed or not the per frame time elapsed in each copy_ was:

argument to_local to_buffer
foo.copy 1.78ms 0.36ms
foo 0.21ms 0.20ms

After running the same code as above with a smaller array (e.g. 600x800x3), and then returning to the larger array: the performance difference almost disappears 😮 feels a lot like a ‘cache’ issue to me.

Painful lessons: part two

While it did take me quite a while to spot this, it’s still almost impossible to determine where it actually makes a difference. There doesn’t seem to be much harm in using the y[:] = or np.copyto so I’ve used it extensively in linuxgreenscreen however in a later post I will discuss a situation where it appears worse.

Numpy array to virtual video (bonus)

linuxgreenscreen outputs to a virtual v4l2loopback device using write, which creates another opportunity for inadvertent copies. In the video examples that follow we’ll use the interface to write provided by the v4l2py python module which is also used by linuxgreenscreen.

Firstly, create a virtual device with id 10, e.g. modprobe v4l2loopback devices=1 video_nr=10. To open the device for output using v4l2py:

from v4l2py.device import BufferType, Device, PixelFormat, VideoOutput

(width, height) = (1920, 1080)

sink = Device.from_id(10)
sink.open()
sink.set_format(BufferType.VIDEO_OUTPUT, width, height, PixelFormat.RGB24)
sink_video = VideoOutput(sink)

To benchmark, we’ll use a function that repeatedly calls another writing function (write_fn) a total of n times with a short break between:

import numpy as np
import time

def run_n_frame(n, write_fn):
    for x in np.arange(0, n):
        buffer = np.random.randint(
            0, high=256, size=(height, width, 3), dtype=np.uint8
        )
        write_fn(buffer)
        time.sleep(1 / 60)

The first write_fn that we’ll pass will convert the numpy array using tobytes and pass the result to sink_video.write():

def write_via_tobytes(buffer):
    sink_video.write(buffer.tobytes())

Before we look at another way to do this, the tobytes documentation tells us that this will create a copy of the data:

Constructs Python bytes showing a copy of the raw contents of data memory.

We’d like to avoid a copy and pass the underlying array data directly to write. One option is to use the data attribute, another is to use ctypes as suggested by an answer to ‘How to convert numpy arrays to bytes/BytesIO without copying?’:

import ctypes

def write_via_data_attr(buffer):
    sink_video.write(buffer.data)

def write_via_ctypes(buffer):
    buffer_ctype = ctypes.c_char * (width * height * 3)
    memory_block = (buffer_ctype).from_address(buffer.ctypes.data)
    sink_video.write(memory_block)

Using cProfile.run('run_n_frame(200, <each write fn>)) the elapsed time for each method (per frame) was:

tobytes data ctypes
5.08ms 1.02ms 1.08ms

This confirms that tobytes creates more work via a copy. I can’t think of any reason to prefer the ctypes approach to using the data attribute, so I would recommend the latter as it appears simpler.

Conclusions

This was hard and I felt like documentation let me down 📚:

  1. The success of the LIVE_STREAM mode in MediaPipe was a fluke 🍀. I can’t be 100% confident in my explanation of the difference between it and VIDEO mode. It’s not documented.

  2. The OpenCV documentation also doesn’t demonstrate successful vs unsuccessful ‘in place’ usage, silent fails and requirements for contiguous memory are not in the OpenCV Python binding guide. Elsewhere in the optimisation recommendations it provides the potentially bad advice:

    Never make copies of an array unless it is necessary. Try to use views instead. Array copying is a costly operation.

    Using a view as an input argument will create a copy if the view is not contiguous!

  3. Numpy and python’s assignment syntax is always a source of amusement. It honestly feels like a hack to use y[:] = to avoid memory allocations, and its unclear when it is preferred. Based on duckduckgo results this doesn’t seem like common usage.

That’s all I’ve got for now.

I’m going to return to linuxgreenscreen to cover another way to write using memory mapped files supplied by v4l2. It makes a marginal difference in performance and paves the way for future work towards and OBS plugin.

👋

Comments