Reducing Number of Registers Used in CUDA Kernel

optimization cuda gpgpu

15,063

Solution 1

Occupancy can be a little misleading and 100% occupancy should not be your primary target. If you can get fully coalesced accesses to global memory then on a high end GPU 50% occupancy will be sufficient to hide the latency to global memory (for floats, even lower for doubles). Check out the Advanced CUDA C presentation from GTC last year for more information on this topic.

In your case, you should measure performance both with and without maxrregcount set to 16. The latency to local memory should be hidden as a result of having sufficient threads, assuming you don't random access into local arrays (which would result in non-coalesced accesses).

To answer you specific question about reducing registers, post the code for more detailed answers! Understanding how compilers work in general may help, but remember that nvcc is an optimising compiler with a large parameter space, so minimising register count has to be balanced with overall performance.

Solution 2

It's really hard to say, nvcc compiler is not very smart in my opinion.
You can try obvious things, for example using short instead of int, passing and using variables by reference (e.g.&variable), unrolling loops, using templates (as in C++). If you have divisions, transcendental functions, been applied in sequence, try to make them as a loop. Try to get rid of conditionals, possibly replacing them with redundant computations.

If you post some code, maybe you will get specific answers.

Solution 3

Utilizing shared memory as cache may lead less register usage and prevent register spilling to local memory...

Think that the kernel calculates some values and these calculated values are used by all of the threads,

__global__ void kernel(...) {
    int idx = threadIdx.x + blockDim.x * blockIdx.x;
    int id0 = blockDim.x * blockIdx.x;

    int reg = id0 * ...;
    int reg0 = reg * a / x + y;


    ...

    int val =  reg + reg0 + 2 * idx;

    output[idx] = val > 10;
}

So, instead of keeping reg and reg0 as registers and making them possibily spill out to local memory (global memory), we may use shared memory.

__global__ void kernel(...) {
    __shared__ int cache[10];

    int idx = threadIdx.x + blockDim.x * blockIdx.x;

    if (threadIdx.x == 0) {
      int id0 = blockDim.x * blockIdx.x;

      cache[0] = id0 * ...;
      cache[1] = cache[0] * a / x + y;
    }
    __syncthreads();


    ...

    int val =  cache[0] + cache[1] + 2 * idx;

    output[idx] = val > 10;
}

Take a look at this paper for further information..

Solution 4

It is not generally a good approach to minimize register pressure. The compiler does a good job optimizing the overall projected kernel performance, and it takes into account lots of factors, incliding register.

How does it work when reducing registers caused slower speed

Most probably the compiler had to spill insufficient register data into "local" memory, which is essentially the same as global memory, and thus very slow

For optimization purposes I would recommend to use keywords like const, volatile and so on where necessary, to help the compiler on the optimization phase.

Anyway, it is not these tiny issues like registers which often make CUDA kernels run slow. I'd recommend to optimize work with global memory, the access pattern, caching in texture memory if possible, transactions over the PCIe.

Solution 5

The instruction count increase when lowering the register usage have a simple explanation. The compiler could be using registers to store the results of some operations that are used more than once through your code in order to avoid recalculating those values, when forced to use less registers, the compiler decides to recalculate those values that would be stored in registers otherwise.

View more solutions

15,063

Author by

zenna

Electronic Engineer, Biomedical Engineer, C++/CUDA

Updated on June 05, 2022

Comments

zenna almost 2 years

I have a kernel which uses 17 registers, reducing it to 16 would bring me 100% occupancy. My question is: are there methods that can be used to reduce the number or registers used, excluding completely rewriting my algorithms in a different manner. I have always kind of assumed the compiler is a lot smarter than I am, so for example I often use extra variables for clarity's sake alone. Am I wrong in this thinking?

Please note: I do know about the --max_registers (or whatever the syntax is) flag, but the use of local memory would be more detrimental than a 25% lower occupancy (I should test this)
phoad almost 11 years

Each separate block needs its own cache area and first thread of each block should fill it. So each block is independent and needs no sync. __syncthreads after the if statement sync.'s the threads in a block. Though, the serial part increases in this way and might not be a good solution..
phoad almost 11 years

Already threadidx.x=6 will not calculate anything. It will get the result of calculation from the cache, and cache will have the result of the calculation as the sync point is passed. Isn't it?
phoad almost 11 years

Do you mean the last two lines? Reading from the cache?? Is there any way to fix it, thread_fence etc?
personne3000 over 9 years

Since registers are 32-bit, and int are 32 bits on the GPU, wouldn't int and short make no difference ?
ZeroCool over 9 years

How would a 50% occupancy will be sufficient? Could you please explain in more details? Thanks a lot.