OpenCL for loop execution model

loops parallel-processing opencl gpu

10,136

Solution 1

1) How exactly are for-loops executed in OpenCL? I know that all work-items run the same code and that work-items within a work group tries to execute in parallel. So if I run a for loop in OpenCL, does that mean all work-items run the same loop or is the loop somehow divided up to run across multiple work items, with each work item executing a part of the loop (ie. work item 1 processes indices 0 ~ 9, item 2 processes indices 10 ~ 19, etc).

You are right. All work items run the same code, but please note that, they may not run the same code at the same pace. Only logically, they run the same code. In the hardware, the work items inside the same wave (AMD term) or warp (NV term), they follow exactly the footprint in the instruction level.

In terms of loop, it is nothing more than just a few branch operations in the assembly code level. Threads from the same wave execute the branch instruction in parallel. If all work items meet the same condition, then they still follow the same path, and run in parallel. However, if they don't agree on the same condition, then typically, there will be divergent execution. For example, in the code below:

if(condition is true)
   do_a();
else
   do_b();

logically, if some work items meet the condition, they will execute do_a() function; while the other work items will execute do_b() function. However, in reality, the work items in a wave execute in exact the same step in the hardware, therefore, it is impossible for them to run different code in parallel. So, some work items will be masked out for do_a() operations, while the wave executes the do_a() function; when it is finished, the wave goes to do_b() function, at this time, the remaining work items are masked out. For either functions, only partial work items are active.

Go back to the loop question, since the loop is a branch operation, if the loop condition is true for some work items, then the above situation will occur, in which some work items execute the code in the loop, while the other work items will be masked out. However, in your code:

for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
      pblock[ti] = pos_old[jb*nt+ti]; /* Cache ONE particle position */
      barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in the work-group */

      for(int j=0; j<nt; j++) { /* For ALL cached particle positions ... */

The loop condition does not depend on the work item IDs, which means that all the work items will have exactly the same loop condition, so they will follow the same execution path and be running in parallel all the time.

2) In this code snippet, how does the outer and inner loops execute? Does OpenCL know that the outer loop is dividing the work among all the work groups and that the inner loop is trying to divide the work among work-items within each work group?

As described in answer to (1), since the loop conditions of outer and inner loops are the same for all work items, they always run in parallel.

In terms of the workload distribution in OpenCL, it totally relies on the developer to specify how to distribute the workload. OpenCL does not know anything about how to divide the workload among work groups and work items. You can partition the workloads by assigning different data and operations by using the global work id or local work id. For example,

unsigned int gid = get_global_id(0);
buf[gid] = input1[gid] + input2[gid];

this code asks each work item to fetch two data from consecutive memory and store the computation results into consecutive memory.

3) If the inner loop is divided among the work-items (meaning that the code within the for loop is executed in parallel, or at least attempted to), how does the addition at the end work? It is essentially doing a = a + f*d, and from my understanding of pipelined processors, this has to be executed sequentially.

     float4 d = p2 - p;
     float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
     float f = p2.w*invr*invr*invr;
     a += f*d; /* Accumulate acceleration */

Here, a, f and d are defined in the kernel code without specifier, which means they are private only to the work item itself. In GPU, these variable will be first assigned to registers; however, registers are typically very limited resources on GPU, so when registers are used up, these variables will be put into the private memory, which is called register spilling (depending on hardware, it might be implemented in different ways; e.g., in some platform, the private memory is implemented using global memory, therefore any register spilling will cause great performance degradation).

Since these variables are private, all the work items still run in parallel and each of the work item maintain and update their own a, f and d, without interfere with each other.

Solution 2

Heterogeneous programming works on work distribution model, meaning threads gets its portion to work on and start on it.

1.1) As you know that, threads are organized in work-group (or thread block) and in your case each thread in work-group (or thread-block) bringing data from global memory to local memory.

for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
      pblock[ti] = pos_old[jb*nt+ti];

//I assume pblock is local memory

1.2) Now all threads in thread-block have the data they need at there local storage (so no need to go to global memory anymore)

1.3) Now comes processing, If you look carefully the for loop where processing takes place

for(int j=0; j<nt; j++) {

which runs for total number of thread blocks. So this loop snippet design make sure that all threads process separate data element.

1) for loop is just like another C statement for OpenCL and all thread will execute it as is, its up-to you how you divide it. OpenCL will not do anything internally for your loop (like point # 1.1).

2) OpenCL don't know anything about your code, its how you divide the loops.

3) Same as statement:1 the inner loop is not divided among the threads, all threads will execute as is, only thing is they will point to the data which they want to process.

I guess this confusion for you is because you jumped into the code before having much knowledge on thread-block and local memory. I suggest you to see the initial version of this code where there is no use of local memory at all.

Solution 3

How exactly are for-loops executed in OpenCL?

They can be unrolled automatically into pages of codes that make it slower or faster to complete. SALU is used for loop counter so when you nest them, more SALU pressure is done and becomes a bottleneck when there are more than 9-10 loops nested (maybe some intelligent algorithm using same counter for all loops should do the trick) So not doing only SALU in the loop body but adding some VALU instructions, is a plus.
They are run in parallel in SIMD so all threads' loops are locked to each other unless there is branching or memory operation. If one loop is adding something, all other threads' loops adding too and if they finish sooner they wait the last thread computing. When they all finish, they continue to next instruction (unless there is branching or memory operation). If there is no local/global memory operation, you dont need synchronization. This is SIMD, not MIMD so it is not efficient when loops are not doing same thing at all threads.

In this code snippet, how does the outer and inner loops execute?

nb and nt are constants and they are same for all threads so all threads doing same amount of work.

If the inner loop is divided among the work-items

That needs opencl 2.0 which has the ability of fine-grain optimization(and spawning kernels in kernel).
- http://developer.amd.com/community/blog/2014/11/17/opencl-2-0-device-enqueue/
  - Look for "subgroup-level functions" and "region growing" titles.
All subgroup threads would have their own accumulators which are then added in the end using a "reduction" operation for speed.

10,136

Author by

tincan

Updated on June 04, 2022

Comments

tincan almost 2 years
I'm currently learning OpenCL and came across this code snippet:
```
int gti = get_global_id(0);
int ti = get_local_id(0);

int n = get_global_size(0);
int nt = get_local_size(0);
int nb = n/nt;

for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
      pblock[ti] = pos_old[jb*nt+ti]; /* Cache ONE particle position */
      barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in the work-group */

      for(int j=0; j<nt; j++) { /* For ALL cached particle positions ... */
         float4 p2 = pblock[j]; /* Read a cached particle position */
         float4 d = p2 - p;
         float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
         float f = p2.w*invr*invr*invr;
         a += f*d; /* Accumulate acceleration */
      }

      barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in work-group */
}
```
Background info about the code: This is part of an OpenCL kernel in a NBody simulation program. The entirety of the code and tutorial can be found here.

Here are my questions (mainly to do with the for loops):
1. How exactly are for-loops executed in OpenCL? I know that all work-items run the same code and that work-items within a work group tries to execute in parallel. So if I run a for loop in OpenCL, does that mean all work-items run the same loop or is the loop somehow divided up to run across multiple work items, with each work item executing a part of the loop (ie. work item 1 processes indices 0 ~ 9, item 2 processes indices 10 ~ 19, etc).
2. In this code snippet, how does the outer and inner loops execute? Does OpenCL know that the outer loop is dividing the work among all the work groups and that the inner loop is trying to divide the work among work-items within each work group?
3. If the inner loop is divided among the work-items (meaning that the code within the for loop is executed in parallel, or at least attempted to), how does the addition at the end work? It is essentially doing a = a + f*d, and from my understanding of pipelined processors, this has to be executed sequentially.
I hope my questions are clear enough and I appreciate any input.