Can one OpenCL device host multiple users on different threads?

multithreading opencl intel

2,120

From personal experience, I can say that there are no problems with multiple threads, using same device with context, shared between threads. There are some ideas on it:

Create multiple kernels from single programm - each kernel for each thread. Quote from Khronos:

clSetKernelArg is safe to call from any host thread, and is safe to call re-entrantly so long as concurrent calls operate on different cl_kernel objects

Though, creating separate command queue for each thread may not be reasonable - driver thread will have hard time, handling too many queues. This may hit your application performance really hard.

If you need to marshall access to shared GPU data between threads, you may dice up big shared OpenCL memory object with multiple (possibly overlapped) sub-objects.

Hope it helps you.

2,120

sedona2222

Updated on November 28, 2022

Comments

sedona2222 over 1 year
We're using Intel OpenCL 1.2 inside a large commercial program, running on a single Intel Haswell CPU/GPU. Conceivably, a number of threads may want to use the GPU for different functions at different times.

So my questions:
1. Is it a good idea at all to allow multiple users to a single device? What complications may we face?
2. I was considering setting up a common context against the device and platform for all users. They would then set up their own programs, kernels and queues. But I'm nervous about device behaviour: can we really create non interacting silos of buffer, programs, queues, kernels and kernel args on one context? At the very least, I see clSetKernelArg is not thread safe.
DarkZeros almost 9 years

Experience addition: There is no difference gain from queues per thread to a single queue for all threads. High number of queues reduces performance, but lock mechanisms in a single queue accessed by many threads as well. So go for the simplest one for you.
Dithermaster almost 9 years

I concur with @DarkZeros; for a thread count even in the dozens I'd still use a command queue per thread. It will allow the GPU to overlap data transfers and compute, and will even allow concurrent compute on some GPUs.
Roman Arzumanyan almost 9 years

Imagine what will happen with 40 worker threads & 40 command queues? Four command queues are enough: device_to_host, host_to_device, device_to_device and queue for kernel execution.
Lubo Antonov almost 9 years

In my work, I launch several processes that use the same GPU. The kernel is compute-intensive. Note that in this case everything is duplicated - kernels, queues, program objects, etc. I don't see any adverse effect from this in the scaling - scaling flattens out once all compute units are fully saturated by wavefronts. The driver seems perfectly capable of handling all the parallelism - which is to be expected.
Roman Arzumanyan almost 9 years

If CPU is powerfull enough that wouldn't be a problem. Though, on platform with weak CPU, I've faced this issue couple times.