Learning faster compositing the hard way
🎯 One of the goals for linuxgreenscreen was to use less than one core on my laptop without losing too much visual (and temporal) quality. I had never used OpenCV or any workflows like MediaPipe, and I hadn’t even used Python 🐍 , so this was a big learning experience.
My initial effort used about three cores of my Ryzen 9 4900HS at 30 FPS for 1280x720, by the time I had finished, I was using about <80% of one core 🥳 .
Here’s a quick summary of the important lessons along the way (before I go into more detail):
- The
LIVE_STREAM
(asynchronous) running mode for MediaPipe’s segmentation used less than half the CPU compared toVIDEO
; i.e. the CPU gets a nice rest 😴 (see the mp.tasks.vision.RunningMode documentation). - MediaPipe’s automatic resizing for segmentation was about 20% more CPU hungry than manually applying OpenCV’s 📉 .
- OpenCV’s in-place operations are brilliant but you must supply continuous-in-memory arrays.
- numpy can wonderfully elide some operations and recycle memory, but it seems haphazard to me.
With regards to the green screen effect, allocating and releasing storage for images was surprisingly expensive 💵 at higher frame rates. It was well worth the effort to understand those last two items in particular, so let’s discuss those further.
In-place OpenCV operations
Many OpenCV image processing operations can be invoked using in-place output
via its dst
argument. If a function does not have a dst
argument then it
cannot be performed in place.
📝 Here’s a quick example using cv2.blur
# generate random input
im_shape = (2160, 3840, 3)
x = np.random.randint(low=0, high=255, size=im_shape, dtype=np.uint8)
# allocate space for output
y = np.zeros(im_shape, dtype=np.uint8)
# store the address of the data to check later
address_y = y.__array_interface__['data'][0]
# not necessary to redirect stdout, but keeps output clean for this example
with open(os.devnull, 'w') as devnull:
with contextlib.redirect_stdout(devnull):
cv2.blur(src=x, ksize=(1,1), dst=y)
We can check that the address of y
hasn’t changed after the call to
cv2.blur
:
print(address_y - y.__array_interface__['data'][0])
# 0 ... address is the same before and after
The kernel of the blur operator is the identity in this example, so we can also
check that y
has the correct output (i.e. is a copy of x
):
print(np.sum(x - y))
# 0 .... x and y are now the same as (1x1) blur kernel is the identity
We can also set the output to be the same as the input:
# now perform an actual blur on `y`
cv2.blur(src=y, ksize=(2,2), dst=y)
print(address_y - y.__array_interface__['data'][0])
# 0 ... address still the same
print(np.sum(x - y))
# non-zero ... the blur worked!
Painful lesson one: valid arguments
The API places the responsibility on us to ensure valid arguments are passed, NOTE:
- The array passed as
dst
must have the same shape and type as the usual return value (withoutdst
), otherwise OpenCV will silently construct a new array for the output. - A numpy array view:
- Cannot be used as a destination.
- May be silently copied if used as a source.
The second item is related to whether the data is continuous in memory or not. I cannot find an official source, but this answers.opencv.org post suggests:
OpenCV APIs all need data in contiguous order. If you give it non-contiguous data from Python, the Python-to-OpenCV wrapper layer internally makes a COPY
If we operate on one channel of an image:
z = cv2.blur(src=x[:,:,0], ksize=(1,1))
then OpenCV will copy and re-arrange the data in the view (created via
src=x[:,:,0]
) into a (new) continuous block. On the other hand you can operate
on a row just fine:
z = cv2.blur(src=x[0,:,:], ksize=(1,1))
Continuity can be checked via the flags
attribute of a numpy array:
x[0,:,:].flags['C_CONTIGUOUS']
# True
x[:,:,0].flags['C_CONTIGUOUS']
# False
In-place numpy assignment
numpy can do an in-place assignment via y[:] = x
or np.copyto(y, x)
.
This usually performs significantly faster than y = x.copy()
.
📝 Here’s an example where np.copyto
performs faster.
# generate random input
im_shape = (900, 1600, 3)
foo = np.random.randint(low=0, high=256, size=im_shape, dtype=np.uint8)
# pre-allocated buffer
y_buf = np.ndarray(im_shape, dtype=np.uint8)
def copy_to_local(x, y_buf):
y = x.copy()
def copy_to_buffer(x, y_buf):
np.copyto(y_buf, x)
with cProfile.Profile() as pr:
for j in np.arange(5000):
# if we remove the .copy(), the performance _within_ the functions
# above changes drastically
copy_to_local(foo.copy(), y_buf)
copy_to_buffer(foo.copy(), y_buf)
pr.print_stats(sort=2)
Depending on whether a copy was passed or not the per-frame time elapsed in
each copy_*
was:
argument | to_local |
to_buffer |
---|---|---|
foo.copy() |
1.48ms | 0.75ms |
foo |
0.46ms | 0.45ms |
Painful lesson two: inconsistent results
Sometimes, using a smaller array, i.e. 600x800x3, I observed equal performance
in all scenarios if I switched np.copyto(y_buf, x)
to y_buf[:] =
!
I prefer np.copyto
from the perspective of clarity; however using y[:] =
seems marginally better in some cases - but I don’t understand this result
clearly.
Painful lesson three: vectorisation
Another quirk is vectorisation and what I call “work in continuous”.
Sometimes, it’s faster to copy not-continuous memory into a continuous block, perform vectorised in-place operations, then copy back to non-continuous. numpy’s elision can be used here, too. Here’s a trivial example where we want to add then multiply one slice of an array:
foo_slice = foo[:,:,2]
y_slice = y_buf[:,:,2]
work_array = np.ndarray(foo_slice.shape, dtype=np.uint8)
def work_non_continuous(x, y_slice):
np.copyto(y_slice, x)
y_slice += 1
y_slice *= 2
def work_continuous(x, y_slice, work):
np.copyto(work, x)
work += 1
work *= 2
np.copyto(y_slice, work)
def elided_work(x, y_slice):
z = 2 * (x + 1)
np.copyto(y_slice, z)
with cProfile.Profile() as pr:
for j in np.arange(5000):
work_non_continuous(foo_slice, y_slice)
work_continuous(foo_slice, y_slice, work_array)
elided_work(foo_slice, y_slice)
pr.print_stats(sort=2)
Can you guess which is quicker? These are the times per call for each function on my machine:
work_non_continuous |
work_continuous |
elided_work |
---|---|---|
3.85ms | 1.90ms | 2.35ms |
Numpy array to virtual video (bonus)
linuxgreenscreen outputs to a virtual v4l2loopback device
using write
, which creates another opportunity for inadvertent
copies. In the video examples that follow we’ll use the interface to write
provided by the v4l2py Python module which is also used by
linuxgreenscreen.
Firstly, create a virtual device with id 10
, e.g.
modprobe v4l2loopback devices=1 video_nr=10
. To open the device for output
using v4l2py:
from v4l2py.device import BufferType, Device, PixelFormat, VideoOutput
(width, height) = (1920, 1080)
sink = Device.from_id(10)
sink.open()
sink.set_format(BufferType.VIDEO_OUTPUT, width, height, PixelFormat.RGB24)
sink_video = VideoOutput(sink)
To benchmark, we’ll use a function that repeatedly calls another writing
function (write_fn
) a total of n
times with a short break between:
import numpy as np
import time
def run_n_frame(n, write_fn):
for x in np.arange(0, n):
buffer = np.random.randint(
0, high=256, size=(height, width, 3), dtype=np.uint8
)
write_fn(buffer)
time.sleep(1 / 60)
The first write_fn
that we’ll pass will convert the numpy array using
tobytes
and pass the result to sink_video.write()
:
def write_via_tobytes(buffer):
sink_video.write(buffer.tobytes())
Before we look at another way to do this, the tobytes
documentation tells us
that this will create a copy of the data:
Constructs Python bytes showing a copy of the raw contents of data memory.
We’d like to avoid a copy and pass the underlying array data directly to
write
. One option is to use the data
attribute,
another is to use ctypes as suggested by an answer to ‘How to
convert numpy arrays to bytes/BytesIO without copying?’:
import ctypes
def write_via_data_attr(buffer):
sink_video.write(buffer.data)
def write_via_ctypes(buffer):
buffer_ctype = ctypes.c_char * (width * height * 3)
memory_block = (buffer_ctype).from_address(buffer.ctypes.data)
sink_video.write(memory_block)
Using cProfile.run('run_n_frame(200, <each write fn>))
the elapsed time for
each method (per frame) was:
tobytes |
data |
ctypes |
---|---|---|
5.08ms | 1.02ms | 1.08ms |
This confirms that tobytes
creates more work via a copy. I can’t think of any
reason to prefer the ctypes
approach to using the data
attribute, so I would
recommend the latter as it appears simpler.
Conclusions
This was hard and documentation 📚 was patchy at best, in summary:
-
The success of the
LIVE_STREAM
mode in MediaPipe was a fluke 🍀. It is not documented that this should use less CPU. -
The OpenCV documentation also doesn’t demonstrate successful vs unsuccessful ‘in place’ usage and the requirements for continuous memory are not in the OpenCV Python binding guide. Elsewhere in the optimisation recommendations it provides the potentially bad advice:
Never make copies of an array unless it is necessary. Try to use views instead. Array copying is a costly operation.
NOTE: We have shown that using a view as an input argument will create a copy if the view is not continuous ❗
-
numpy and Python’s assignment syntax is a source of amusement.
copyto
can degrade performance of out-of-place operations in other parts of the code,y[:] =
seems more reliable. -
numpy appears to take advantage of vectorisation when operations are on continuous blocks of memory, eliding or copying to a continous block (then copying back, provided everything is re-used) can be faster. None of this is documented clearly.
That’s all I’ve got for now.
Update: I was going to return to linuxgreenscreen to cover using memory mapped files supplied by v4l2 to reduce the number of copy operations further, however I contacted Tiago Cutinho and we worked out how to incorporate this into linuxpy. I’ll revisit this again when I start work on an OBS plugin.
👋