memory allocation inside a CUDA kernel

21,456

I think the reason introducing malloc() slows your code down is that it allocates memory in global memory. When you use a fixed size array, the compiler is likely to put it in the register file, which is much faster.

Having to do a malloc inside your kernel may mean that you're trying to do too much work with a single kernel. If each thread allocates a different amount of memory, then each thread runs a different number of times in the for loop, and you get lots of warp divergence.

If each thread in a warp runs loops the same number of times, just allocate up front. Even if they run a different number of times, you can use a constant size. But instead, I think you should look at how you can refactor your code to entirely remove that loop from your kernel.

Share:
21,456
RNs_Ghost
Author by

RNs_Ghost

Updated on July 09, 2022

Comments

  • RNs_Ghost
    RNs_Ghost almost 2 years

    I have the following (snippet) of a kernel.

    __global__ void plain(int* geneVec, float* probs, int* nComponents, float* randomNumbers,int *nGenes)
    {
    
        int xid = threadIdx.x + (blockDim.x * blockIdx.x);
    
        float* currentProbs= (float*)malloc(sizeof(float)*tmp);
    
             .....
             .....
    
        currentProbs[0] = probs[start];
        for (k=1;k<nComponents[0]; k++)
        {
            currentProbs[k] = currentProbs[k-1] + prob;
        }
    
           ...
           ...
          free(currentProbs);
    
    }
    

    When it's static (even the same sizes) it's very fast, but when CurrentProbs is dynamically allocated (as above) performance is awful.

    This question said I could do this inside a kernel: CUDA allocate memory in __device__ function

    Here is a related question: Efficiency of Malloc function in CUDA

    I was wondering if any other methods have solved this other than the one proposed in the paper? It seems ridiculous that one cannot malloc/free inside a kernel without this sort of penalty.