memory allocation inside a CUDA kernel
I think the reason introducing malloc() slows your code down is that it allocates memory in global memory. When you use a fixed size array, the compiler is likely to put it in the register file, which is much faster.
Having to do a malloc inside your kernel may mean that you're trying to do too much work with a single kernel. If each thread allocates a different amount of memory, then each thread runs a different number of times in the for loop, and you get lots of warp divergence.
If each thread in a warp runs loops the same number of times, just allocate up front. Even if they run a different number of times, you can use a constant size. But instead, I think you should look at how you can refactor your code to entirely remove that loop from your kernel.
RNs_Ghost
Updated on July 09, 2022Comments
-
RNs_Ghost almost 2 years
I have the following (snippet) of a kernel.
__global__ void plain(int* geneVec, float* probs, int* nComponents, float* randomNumbers,int *nGenes) { int xid = threadIdx.x + (blockDim.x * blockIdx.x); float* currentProbs= (float*)malloc(sizeof(float)*tmp); ..... ..... currentProbs[0] = probs[start]; for (k=1;k<nComponents[0]; k++) { currentProbs[k] = currentProbs[k-1] + prob; } ... ... free(currentProbs); }
When it's static (even the same sizes) it's very fast, but when CurrentProbs is dynamically allocated (as above) performance is awful.
This question said I could do this inside a kernel: CUDA allocate memory in __device__ function
Here is a related question: Efficiency of Malloc function in CUDA
I was wondering if any other methods have solved this other than the one proposed in the paper? It seems ridiculous that one cannot malloc/free inside a kernel without this sort of penalty.