CUDA Block and Grid size efficiencies

optimization cuda gpgpu

17,793

Solution 1

It s usually a case of setting block size for optimal performance, and grid size according to the total amount of work. Most kernels have a "sweet spot" number of warps per Mp where they work best, and you should do some benchmarking/profiling to see where that is. You probably still need over-spill logic in the kernel because problem sizes are rarely round multiples of block sizes.

EDIT: To give a concrete example of how this might be done for a simple kernel (in this case a custom BLAS level 1 dscal type operation done as part of a Cholesky factorization of packed symmetric band matrices):

// Fused square root and dscal operation
__global__ 
void cdivkernel(const int n, double *a)
{
    __shared__ double oneondiagv;

    int imin = threadIdx.x + blockDim.x * blockIdx.x;
    int istride = blockDim.x * gridDim.x;

    if (threadIdx.x == 0) {
        oneondiagv = rsqrt( a[0] );
    }
    __syncthreads();

    for(int i=imin; i<n; i+=istride) {
        a[i] *= oneondiagv;
    }
}

To launch this kernel, the execution parameters are calculated as follows:

We allow up to 4 warps per block (so 128 threads). Normally you would fix this at an optimal number, but in this case the kernel is often called on very small vectors, so having a variable block size made some sense.
We then compute the block count according to the total amount of work, up to 112 total blocks, which is the equivalent of 8 blocks per MP on a 14 MP Fermi Telsa. The kernel will iterate if the amount of work exceeds grid size.

The resulting wrapper function containing the execution parameter calculations and kernel launch look like this:

// Fused the diagonal element root and dscal operation into
// a single "cdiv" operation
void fusedDscal(const int n, double *a)
{
    // The semibandwidth (column length) determines
    // how many warps are required per column of the 
    // matrix.
    const int warpSize = 32;
    const int maxGridSize = 112; // this is 8 blocks per MP for a Telsa C2050

    int warpCount = (n / warpSize) + (((n % warpSize) == 0) ? 0 : 1);
    int warpPerBlock = max(1, min(4, warpCount));

    // For the cdiv kernel, the block size is allowed to grow to
    // four warps per block, and the block count becomes the warp count over four
    // or the GPU "fill" whichever is smaller
    int threadCount = warpSize * warpPerBlock;
    int blockCount = min( maxGridSize, max(1, warpCount/warpPerBlock) );
    dim3 BlockDim = dim3(threadCount, 1, 1);
    dim3 GridDim  = dim3(blockCount, 1, 1);

    cdivkernel<<< GridDim,BlockDim >>>(n,a);
    errchk( cudaPeekAtLastError() );
}

Perhaps this gives some hints about how to design a "universal" scheme for setting execution parameters against input data size.

Solution 2

Ok I guess we are dealing with two questions here.

1) Good way to assign block sizes (i.e. the number of threads) This usually depends on the kind of data you are dealing with. Are you dealing with vectors ? Are you dealing with matrices ? The suggested way is to keep the number of threads in multiples of 32. So when dealing with vectors, launching 256 x 1, 512 x 1 blocks may be fine. And similariy when dealing with matrices, 32 x 8, 32 x 16.

2) Good way to assign grid sizes (i.e. the number of blocks) It gets a bit tricky over here. Just launching 10,000 blocks because we can is not normally the best way to do things. Switching blocks in and out of hardware is costly. Two things to consider are the shared memory being used per block, and the total number of SPs available, and solve for the optimal number.

You can find a really good implementation of how to do that from thrust. It may take a while to figure out what's happening inside the code though.

Solution 3

I think it's usually best to set the block and grid sizes based on the problem set, especially for optimization purposes. Having extra threads that do nothing doesn't really make sense and can worsen the performance of your programs.

Solution 4

If you have dynamically sized data sets then you will likely run into some issues with latency while some threads and blocks wait for others to complete.

This site has some great heuristics. Some general highlights:

Choosing Blocks Per Grid

Blocks per grid should be >= number of multiprocessors.
The more use of __syncthreads() in your kernels, the more blocks (so that one block can run while another waits to sync)

Choosing Threads Per Block

Threads in multiples of warp size (i.e. generally 32)
Generally good to choose number of threads such that max number of threads per block (based on hardware) is a multiple of number of threads. E.g. with max threads of 768, using 256 threads per block will tend to be better than 512 because multiple threads can run simultaneously on a block.

View more solutions

17,793

Author by

Bolster

I'm a PhD student of Electronics & Software Engineering at Queen's University Belfast's Institute of Electronics, Comminucations, and Information Technology (ECIT). I want to get more involved in FOSS (as I've been using FOSS for years now). I try to document my experiments and experiences on my blog so check it out. Also an active founder of QUESTS (Queen's University, Engineering, Science, and Technology Society) and Farset Labs in Belfast, Northern Ireland

Updated on June 18, 2022

Comments

Bolster almost 2 years

What is the advised way of dealing with dynamically-sized datasets in cuda?

Is it a case of 'set the block and grid sizes based on the problem set' or is it worthwhile to assign block dimensions as factors of 2 and have some in-kernel logic to deal with the over-spill?

I can see how this probably matters alot for the block dimensions, but how much does this matter to the grid dimensions? As I understand it, the actual hardware constraints stop at the block level (i.e blocks assigned to SM's that have a set number of SP's, and so can handle a particular warp size).

I've perused Kirk's 'Programming Massively Parallel Processors' but it doesn't really touch on this area.
Bolster about 13 years

Any ideas in the griddim area?
Pavan Yalamanchili about 13 years

Well you are partially right. having 16 threads (half warp) instead of say 14 makes sense, rather than going all the way to 256.
Markus Joschko about 13 years

Pavan: Could you point out where in Thrust this calculation is happening?
talonmies about 13 years

@Ashwin: thrust::detail::backend::cuda::detail::launch_closure contains all the gory details.
talonmies about 13 years

Not that awesome. I sanitised the code a bit before I posted it and introduced an error in the block size calculation in the process. Now fixed, but hopefully you get the idea anyway....
Pavan Yalamanchili about 13 years

@talonmies, this is beautiful. My one query though is regarding this. "for(int i=imin; i<n; i+=istride)" Looks like this would make the work a bit uneven across blocks. Especially if n = 1.5 * stride;
talonmies about 13 years

@Pavan: yes it does mean that some blocks will finish earlier than others, and the "last" block will have some warp divergence. But overall I have still found it better that the alternatives, like having a "half" GPU worth of blocks at the end of a kernel launch. Keeping the kernel blocks resident helps amortize the "setup" indexing and square root calculations so their impact on the overall performance is lowered.
Pavan Yalamanchili about 13 years

@talonmies, I did not mean you needed to launch more blocks, just that the work might have been spread out across the blocks evenly. Like if you had n = 1.5 * stride, instead of retiring half the blocks and performing another step for the rest, you could try retiring half the warps per block and spread the work across all blocks. Just offering a thought, because that is how I usually do things. This (your code) here is slightly new and might be a great idea for a few applications. I need to test it out :)
Bolster about 13 years

@Talonmies saw the 'error' but I just assumed it was an implementation specific modification. Wrote my own based on the general theory and works grand, with some modifications based on the occupancy characteristics of my kernels.
Bolster almost 13 years

Just came across an interesting corner case for this. If your workload (in this case n) is 65535, i.e the max grid dimension, this falls over because it tries to do 65536 to satisfy n%warpsize==0. Other than that this is very handy.
talonmies almost 13 years

@Andrew Bolster: That corner case doesn't exist as the code was written because there is a limit on the maximum number of blocks which the launch can use. The work can ask for an impossibly large block count, but it will always be limited to maxGridSize. The only way that could fail was if you set maxGridSize to be larger than the maximum number of blocks in a 1D grid the hardware allows.
Bolster almost 13 years

Which was exactly the mistake that I found in my application :)
syntagma about 10 years

What is the point of using max() here: max(1, min(4, warpCount))?