How to use Python and OpenCV with multiprocessing?

python opencv image-processing parallel-processing openmp

18,396

Solution 1

After reading some SO posts, I've come up with a way to use OpenCV in Python3 with multiprocessing. I recommend doing this on linux, because according to this post, spawned processes share memory with their parent as long as the content is not changed. Here's a minimal example:

import cv2
import multiprocessing as mp
import numpy as np
import psutil

img = cv2.imread('test.tiff', cv2.IMREAD_ANYDEPTH) # here I'm using a indexed 16-bit tiff as an example.
num_processes = 4
kernel_size = 11
tile_size = img.shape[0]/num_processes  # Assuming img.shape[0] is divisible by 4 in this case

output = mp.Queue()

def mp_filter(x, output):
    print(psutil.virtual_memory())  # monitor memory usage
    output.put(x, cv2.GaussianBlur(img[img.shape[0]/num_processes*x:img.shape[0]/num_processes*(x+1), :], 
               (kernel_size, kernel_size), kernel_size/5))
    # note that you actually have to process a slightly larger block and leave out the border.

if __name__ == 'main':
    processes = [mp.Process(target=mp_filter, args=(x, output)) for x in range(num_processes)]

    for p in processes:
        p.start()

    result = []
    for ii in range(num_processes):
        result.append(output.get(True))

    for p in processes:
        p.join()

Instead of using Queue, another way to collect the result from the processes is to create a shared array through multiprocessing module. (Has to import ctypes)

result = mp.Array(ctypes.c_uint16, img.shape[0]*img.shape[1], lock = False)

Then each process can write to different portions of the array assuming there is no overlap. Creating a large mp.Array is surprisingly slow, however. This actually defies the purpose of speeding up the operation. So use it only when the added time is not much when compared with total computation time. This array can be turned into a numpy array by :

result_np = np.frombuffer(result, dtypye=ctypes.c_uint16)

Solution 2

I don't know what types of filters you need, but if it's reasonably simple, you could consider libvips. It's an image processing system for very large images (larger than the amount of memory you have). It came out of a series of EU-funded scientific art imaging projects, so the focus is on the types of operation you need for image capture and comparison: convolution, rank, morphology, arithmetic, colour analysis, resampling, histograms, and so on.

It's fast (faster than OpenCV, on some benchmarks at least), needs little memory, and there's a high-level Python binding. It works on Linux, OS X and Windows. It handles all the multiprocessing for you automatically.

Solution 3

This can be done cleanly with Ray, which is a library for parallel and distributed Python. Ray reasons about "tasks" instead of using a fork-join model, which gives some additional flexibility (e.g., you an put values in shared memory even after forking worker processes), the same code runs on multiple machines, you can compose tasks together, etc.

import cv2
import numpy as np
import ray

num_tasks = 4
kernel_size = 11


@ray.remote
def mp_filter(image, i):
    lower = image.shape[0] // num_tasks * i
    upper = image.shape[0] // num_tasks * (i + 1)
    return cv2.GaussianBlur(image[lower:upper, :],
                            (kernel_size, kernel_size), kernel_size // 5)


if __name__ == '__main__':
    ray.init()

    # Load the image and store it once in shared memory.
    image = np.random.normal(size=(1000, 1000))
    image_id = ray.put(image)

    result_ids = [mp_filter.remote(image_id, i) for i in range(num_tasks)]
    results = ray.get(result_ids)

Note that you can store more than just numpy arrays in shared memory, you can also benefit if you have Python objects that contain numpy arrays (like dictionaries containing numpy arrays). Under the hood, this uses the Plasma shared-memory object store and the Apache Arrow data layout.

You can read more in the Ray documentation. Note that I'm one of the Ray developers.

18,396

Author by

user3667217

Experimental biologist. Programming enthusiast. Voracious learner. SOreadytohelp

Updated on July 19, 2022

Comments

user3667217 over 1 year

I'm using Python 3.4.3 and OpenCV 3.0.0 to process (applying various filters to) a very large image (80,000 x 60,000) in memory and I'd like to use multiple CPU cores to improve performance. After some reading, I arrived at two possible method : 1) Use python's multiprocessing module, let each process deal with a slice of the large image and join the results after processing is done (And this probably should be performed on POSIX system?) 2) Since NumPy supports OpenMP and OpenCV uses NumPy, I can just leave the multiprocessing to NumPy?

So my question is :

Which one will be a better solution? (If they don't seem reasonable, what would be a possible approach? )

If Option 2 is good, should I build both NumPy and OpenCV with OpenMP ? How would I actually make multi-processing happen? ( I couldn't really find useful instruction..)
user3667217 over 8 years

Thanks for the input. In terms of filtering, I perform Guassian and Laplacian of Gaussian on the image, which I believe VIPS will perform very well. I didn't go with VIPS, though because I also perform other operations on the image, such as logical operation between images. NumPy makes this easier and this is the reason I chose OpenCV rather than VIPS.
jcupitt over 8 years

By logical operations, do you mean pixelwise and/or/eor? libvips has that, eg. a = (a << 8) & b ^ c etc.
user3667217 over 8 years

Yes. Thank you for your comment. I actually tried VIPS. I could only make image convolution work, however. It indeed performs much better. I've started a new question here : stackoverflow.com/questions/33195055/… Also, I managed to make multiprocessing work with openCV, and I'll post an answer.
jcupitt over 8 years

Hi, I posted some stuff on your new question, hope it helps.
Eilyre almost 7 years

GaussianBlur not GaussianBlue. Can't suggest an edit due to StackOverflow-s limit of 6 characters.
Mooncrater almost 6 years

@Eilyre A fiddle here and there. Done.
ZF007 about 5 years

Any benchmark tests against multiprocessing default from python (3.x)? If so..share a link. (not interested in Pong).
Robert Nishihara about 5 years

I don't have a link, but the biggest difference will be for large numerical data. E.g., if x = np.zeros(10**8) and f is the identity function (or identity remote function in Ray), then on my laptop, %time pool.apply_async(f, args=(x, )).get() takes 15 seconds and %time ray.get(f.remote(x)) takes 1.5 seconds.
jtlz2 over 4 years

@RobertNishihara Out of left field: Do you know of anyone calling Ray from within Flask?