CUDA allocating array of arrays

memory-management cuda allocation

13,020

Solution 1

You have to allocate the pointers to a host memory, then allocate device memory for each array and store it's pointer in the host memory. Then allocate the memory for storing the pointers into the device and then copy the host memory to the device memory. One example is worth 1000 words:

__global__ void multi_array_kernel( int N, void** arrays ){
    // stuff
}


int main(){

    const int N_ARRAYS = 20;
    void *h_array = malloc(sizeof(void*) * N_ARRAYS);
    for(int i = 0; i < N_ARRAYS; i++){
        cudaMalloc(&h_array[i], i * sizeof(void*));
        //TODO: check error
    }
    void *d_array = cudaMalloc(sizeof(void*) * N_ARRAYS);

    // Copy to device Memory
    cudaMemcpy(d_array, h_array, sizeof(void*) * N_ARRAYS, cudaMemcpyHostToDevice);

    multi_array_kernel<1,1>(N_ARRAYS, d_array);
    cudaThreadSynchronize();

    for(int i = 0; i < N_ARRAYS; i++){
        cudaFree(h_array[i]); //host not device memory
        //TODO: check error
    }
    cudaFree(d_array);
    free(h_array);
}

Solution 2

I don't believe this is supported. cudaMalloc() allocates device memory, but stores the address in a variable on the host. In your for-loop, you are passing it addresses in device memory.

Depending on what you're trying to accomplish, you may want to allocate data with normal host malloc() before calling the for-loop as you currently have it. Or allocate a single big block of device memory and compute offsets into it manually.

Look at Sections 2.4, 3.2.1 and B.2.5 (bottom) of the CUDA Programming Guide for more discussion of this. Specifically, on the bottom of page 108:

The address obtained by taking the address of a __device__, __shared__ or __constant__ variable can only be used in device code.

Solution 3

you cannot use

cudaMalloc(&h_array[i], i * sizeof(void*));

for array declared as void *

use defined data type

CUdeviceptr *h_array = malloc(sizeof(CUdeviceptr *) * N);

int *h_array = malloc(sizeof(int *) * N);

and cast it to void *

cudaMalloc((void *)&h_array[i], i * sizeof(void*));

Solution 4

I think in the first loop it should be &h_array[i] not &d_array[i].

Solution 5

I had the same Problem and managed to solve it.

FabrizioM's answer was a good point to start for me and helped me a lot. But nevertheless i encountered some problems when i tried to transfer the code to my project. Using the additional comments and posts i was able to write a working example (VS2012, CUDA7.5). Thus i will post my code as additional answer and as point to start for others.

To understand the naming: I'm using a vector of OpenCV cv::Mat as input which are captured from multiple cameras and i am processing these images in the Kernel.

     void TransferCameraImageToCuda(const std::vector<cv::Mat*>* Images)
{

     int NumberCams     = Images->size();
     int imageSize      = Images->at(0)->cols*Images->at(0)->rows;

     CUdeviceptr*           CamArraysAdressOnDevice_H;
     CUdeviceptr*           CamArraysAdressOnDevice_D;


         //allocate memory on host to store the device-address of each array
         CamArraysAdressOnDevice_H = new CUdeviceptr[NumberCams];

         // allocate memory on the device and store the arrays on the device 
         for (int i = 0; i < NumberCams; i++){
             cudaMalloc((void**)&(CamArraysAdressOnDevice_H[i]), imageSize * sizeof(unsigned short));
             cudaMemcpy((void*)CamArraysAdressOnDevice_H[i], Images->at(i)->data, imageSize * sizeof(unsigned short), cudaMemcpyHostToDevice);
         }

         // allocate memory on the device to store the device-adresses of the arrays
         cudaMalloc((void**)&CamArraysAdressOnDevice_D, sizeof(CUdeviceptr*)* NumberCams);

         // Copy the adress of each device array to the device
         cudaMemcpy(CamArraysAdressOnDevice_D, CamArraysAdressOnDevice_H, sizeof(CUdeviceptr*)* NumberCams, cudaMemcpyHostToDevice);




}

In the kernel launch I'm casting the device pointer to the data type pointer (unsigned short**)

DummyKernel<<<gridDim,blockDim>>>(NumberCams, (unsigned short**) CamArraysAdressOnDevice_D)

and the kernel definition is for example:

__global__ void DummyKernel(int NumberImages, unsigned short** CamImages)
{
    int someIndex = 3458;
    printf("Value Image 0 : %d \n", CamImages[0][someIndex]);
    printf("Value Image 1 : %d \n", CamImages[1][someIndex]);
    printf("Value Image 2 : %d \n", CamImages[2][someIndex]);
}

View more solutions

13,020

Author by

Admin

Updated on June 06, 2022

Comments

Admin almost 2 years

I have some trouble with allocate array of arrays in CUDA.

void ** data;
cudaMalloc(&data, sizeof(void**)*N); // allocates without problems
for(int i = 0; i < N; i++) {
    cudaMalloc(data + i, getSize(i) * sizeof(void*)); // seg fault is thrown
}

What did I wrong?