Learning faster compositing the hard way
One of the goals 🎯 for linuxgreenscreen was to use less than one core on my laptop without losing too much visual (and temporal) quality. I had never used OpenCV or any workflows like MediaPipe, and I hadn’t even used python 🐍, so this was a big learning experience and I’d like to share some of the lessons here.
My initial effort used about three cores of my Ryzen 9 4900HS at 30 FPS for 1280x720, by the time I had finished, I was using about 80-90% of one core 🥳.
I learnt some important things about MediaPipe’s segmentation but more generally I learnt some painful lessons about ‘in place’ operations in both OpenCV and numpy.
- MediaPipe segmentation task has two running modes for video:
running_mode.VIDEO
andrunning_mode.LIVE_STREAM
(see the mp.tasks.vision.RunningMode documentation).VIDEO
seems, I think, to have busy waits that eat CPUs alive, whereasLIVE_STREAM
uses less than half the CPU by sleeping 😴. - The (input) tensor used by the segmentation model is 144x256x3 in landscape
mode (or 256x256x3 for square). MediaPipe automatically resizes any input,
and returns output at the original size; however using OpenCV
cv2.resize
instead can lighten the load: in early tests the segmentation time went from ~17ms to 12ms with a 20% relative drop in CPU usage 📉. - OpenCV’s in-place operations mostly work well with some caveats about the API, while numpy often recycles memory but there are quirky situations where it does not.
Regarding that final point: allocating and releasing storage for images can be surprisingly expensive 💵 at higher frame rates, so it was ultimately worth the effort to experiment and benchmark these.
In-place OpenCV operations
Many OpenCV image processing operations can be invoked using in-place output
via its dst
argument. If a function does not have a dst
argument then it
cannot be performed in place.
📝 Here’s a quick example using cv2.blur
# generate random input
im_shape = (2160, 3840, 3)
x = np.random.randint(low=0, high=255, size=im_shape, dtype=np.uint8)
# allocate space for output
y = np.zeros(im_shape, dtype=np.uint8)
# store the address of the data to check later
address_y = y.__array_interface__['data'][0]
# not necessary to redirect stdout, but keeps output clean for this example
with open(os.devnull, 'w') as devnull:
with contextlib.redirect_stdout(devnull):
cv2.blur(src=x, ksize=(1,1), dst=y)
We can check that the address of y
hasn’t changed after the call to
cv2.blur
:
print(address_y - y.__array_interface__['data'][0])
# 0 ... address is the same before and after
The kernel is the identity in this example, so we can also check that y
has
the correct output (i.e. is a copy of x
):
print(np.sum(x - y))
# 0 .... x and y are now the same as (1x1) blur kernel is the identity
You can also set the output to be the same as the input:
# now perform an actual blur on `y`
cv2.blur(src=y, ksize=(2,2), dst=y)
print(address_y - y.__array_interface__['data'][0])
# 0 ... address still the same
print(np.sum(x - y))
# non-zero ... the blur worked!
Painful lessons: part one
The API places the responsibility on us to ensure valid arguments are passed, and it was seriously hard for me to learn:
- The array passed as
dst
must have the same shape and type as the usual return value (withoutdst
), otherwise OpenCV will silently not place the result in the array. - A numpy array view:
- Cannot be used as a destination.
- May be silently copied if used as a source.
The second item is related to whether the data is continuous in memory or not. I cannot find an official source, but this answers.opencv.org post suggests:
OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY
If you operate on one channel of an image:
z = cv2.blur(src=x[:,:,0], ksize=(1,1))
then OpenCV will have to copy and re-arrange the data in the view that will be
created via src=x[:,:,0]
into a (new) continuous block. On the other hand
you can operate on a row just fine:
z = cv2.blur(src=x[0,:,:], ksize=(1,1))
Continuity can be checked via the flags
attribute of a numpy array:
x[0,:,:].flags['C_CONTIGUOUS']
# True
x[:,:,0].flags['C_CONTIGUOUS']
# False
User beware 🚩
In-place numpy assignment
Another discovery was that numpy can do an in-place assignment via y[:] = x
which can perform significantly faster than y = x.copy()
in quirky
circumstances. In others it may perform marginally slower. It’s somehow
related to numpy’s usage of cache.
📝 Here’s an example where y[:] =
performs faster. Start with a clean python
terminal:
# generate random input
im_shape = (900, 1600, 3)
foo = np.random.randint(low=0, high=256, size=im_shape, dtype=np.uint8)
# pre-allocated buffer
y_buf = np.ndarray(im_shape, dtype=np.uint8)
def copy_to_local(x, y_buf):
y = x.copy()
def copy_to_buffer(x, y_buf):
y_buf[:] = x
with cProfile.Profile() as pr:
for j in np.arange(1000):
# if we remove the .copy(), the performance _within_ the functions
# above changes drastically
copy_to_local(foo.copy(), y_buf)
copy_to_buffer(foo.copy(), y_buf)
pr.print_stats(sort=2)
Depending on whether a copy was passed or not the per frame time elapsed in
each copy_
was:
argument | to_local |
to_buffer |
---|---|---|
foo.copy |
1.78ms | 0.36ms |
foo |
0.21ms | 0.20ms |
After running the same code as above with a smaller array (e.g. 600x800x3), and then returning to the larger array: the performance difference almost disappears 😮 feels a lot like a ‘cache’ issue to me.
Painful lessons: part two
While it did take me quite a while to spot this, it’s still almost impossible
to determine where it actually makes a difference. There doesn’t seem to be
much harm in using the y[:] =
or np.copyto
so I’ve used it extensively in
linuxgreenscreen however in a later post I will discuss a situation where it
appears worse.
Numpy array to virtual video (bonus)
linuxgreenscreen outputs to a virtual v4l2loopback device
using write
, which creates another opportunity for inadvertent
copies. In the video examples that follow we’ll use the interface to write
provided by the v4l2py python module which is also used by
linuxgreenscreen.
Firstly, create a virtual device with id 10
, e.g.
modprobe v4l2loopback devices=1 video_nr=10
. To open the device for output
using v4l2py:
from v4l2py.device import BufferType, Device, PixelFormat, VideoOutput
(width, height) = (1920, 1080)
sink = Device.from_id(10)
sink.open()
sink.set_format(BufferType.VIDEO_OUTPUT, width, height, PixelFormat.RGB24)
sink_video = VideoOutput(sink)
To benchmark, we’ll use a function that repeatedly calls another writing
function (write_fn
) a total of n
times with a short break between:
import numpy as np
import time
def run_n_frame(n, write_fn):
for x in np.arange(0, n):
buffer = np.random.randint(
0, high=256, size=(height, width, 3), dtype=np.uint8
)
write_fn(buffer)
time.sleep(1 / 60)
The first write_fn
that we’ll pass will convert the numpy array using
tobytes
and pass the result to sink_video.write()
:
def write_via_tobytes(buffer):
sink_video.write(buffer.tobytes())
Before we look at another way to do this, the tobytes
documentation tells us
that this will create a copy of the data:
Constructs Python bytes showing a copy of the raw contents of data memory.
We’d like to avoid a copy and pass the underlying array data directly to
write
. One option is to use the data
attribute,
another is to use ctypes as suggested by an answer to ‘How to
convert numpy arrays to bytes/BytesIO without copying?’:
import ctypes
def write_via_data_attr(buffer):
sink_video.write(buffer.data)
def write_via_ctypes(buffer):
buffer_ctype = ctypes.c_char * (width * height * 3)
memory_block = (buffer_ctype).from_address(buffer.ctypes.data)
sink_video.write(memory_block)
Using cProfile.run('run_n_frame(200, <each write fn>))
the elapsed time for
each method (per frame) was:
tobytes |
data |
ctypes |
---|---|---|
5.08ms | 1.02ms | 1.08ms |
This confirms that tobytes
creates more work via a copy. I can’t think of any
reason to prefer the ctypes
approach to using the data
attribute, so I would
recommend the latter as it appears simpler.
Conclusions
This was hard and I felt like documentation let me down 📚:
-
The success of the
LIVE_STREAM
mode in MediaPipe was a fluke 🍀. I can’t be 100% confident in my explanation of the difference between it andVIDEO
mode. It’s not documented. -
The OpenCV documentation also doesn’t demonstrate successful vs unsuccessful ‘in place’ usage, silent fails and requirements for contiguous memory are not in the OpenCV Python binding guide. Elsewhere in the optimisation recommendations it provides the potentially bad advice:
Never make copies of an array unless it is necessary. Try to use views instead. Array copying is a costly operation.
Using a view as an input argument will create a copy if the view is not contiguous!
-
Numpy and python’s assignment syntax is always a source of amusement. It honestly feels like a hack to use
y[:] =
to avoid memory allocations, and its unclear when it is preferred. Based on duckduckgo results this doesn’t seem like common usage.
That’s all I’ve got for now.
I’m going to return to linuxgreenscreen to cover another way to write using memory mapped files supplied by v4l2. It makes a marginal difference in performance and paves the way for future work towards and OBS plugin.
👋
Comments