Can one OpenCL device host multiple users on different threads?

2,120

From personal experience, I can say that there are no problems with multiple threads, using same device with context, shared between threads. There are some ideas on it:

Create multiple kernels from single programm - each kernel for each thread. Quote from Khronos:

clSetKernelArg is safe to call from any host thread, and is safe to call re-entrantly so long as concurrent calls operate on different cl_kernel objects

Though, creating separate command queue for each thread may not be reasonable - driver thread will have hard time, handling too many queues. This may hit your application performance really hard.

If you need to marshall access to shared GPU data between threads, you may dice up big shared OpenCL memory object with multiple (possibly overlapped) sub-objects.

Hope it helps you.

Share:
2,120

Related videos on Youtube

sedona2222
Author by

sedona2222

Updated on November 28, 2022

Comments

  • sedona2222
    sedona2222 over 1 year

    We're using Intel OpenCL 1.2 inside a large commercial program, running on a single Intel Haswell CPU/GPU. Conceivably, a number of threads may want to use the GPU for different functions at different times.

    So my questions:

    1. Is it a good idea at all to allow multiple users to a single device? What complications may we face?

    2. I was considering setting up a common context against the device and platform for all users. They would then set up their own programs, kernels and queues. But I'm nervous about device behaviour: can we really create non interacting silos of buffer, programs, queues, kernels and kernel args on one context? At the very least, I see clSetKernelArg is not thread safe.

  • DarkZeros
    DarkZeros almost 9 years
    Experience addition: There is no difference gain from queues per thread to a single queue for all threads. High number of queues reduces performance, but lock mechanisms in a single queue accessed by many threads as well. So go for the simplest one for you.
  • Dithermaster
    Dithermaster almost 9 years
    I concur with @DarkZeros; for a thread count even in the dozens I'd still use a command queue per thread. It will allow the GPU to overlap data transfers and compute, and will even allow concurrent compute on some GPUs.
  • Roman Arzumanyan
    Roman Arzumanyan almost 9 years
    Imagine what will happen with 40 worker threads & 40 command queues? Four command queues are enough: device_to_host, host_to_device, device_to_device and queue for kernel execution.
  • Lubo Antonov
    Lubo Antonov almost 9 years
    In my work, I launch several processes that use the same GPU. The kernel is compute-intensive. Note that in this case everything is duplicated - kernels, queues, program objects, etc. I don't see any adverse effect from this in the scaling - scaling flattens out once all compute units are fully saturated by wavefronts. The driver seems perfectly capable of handling all the parallelism - which is to be expected.
  • Roman Arzumanyan
    Roman Arzumanyan almost 9 years
    If CPU is powerfull enough that wouldn't be a problem. Though, on platform with weak CPU, I've faced this issue couple times.