Copying a struct containing pointers to CUDA device

18,662

Solution 1

Edit: CUDA 6 introduces Unified Memory, which makes this "deep copy" problem a lot easier. See this post for more details.


Don't forget that you can pass structures by value to kernels. This code works:

// pass struct by value (may not be efficient for complex structures)
__global__ void kernel2(StructA in)
{
    in.arr[threadIdx.x] *= 2;
}

Doing so means you only have to copy the array to the device, not the structure:

int h_arr[N] = {1,2,3,4,5,6,7,8,9,10};
StructA h_a;
int *d_arr;

// 1. Allocate device array.
cudaMalloc((void**) &(d_arr), sizeof(int)*N);

// 2. Copy array contents from host to device.
cudaMemcpy(d_arr, h_arr, sizeof(int)*N, cudaMemcpyHostToDevice);

// 3. Point to device pointer in host struct.
h_a.arr = d_arr;

// 4. Call kernel with host struct as argument
kernel2<<<N,1>>>(h_a);

// 5. Copy pointer from device to host.
cudaMemcpy(h_arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);

// 6. Point to host pointer in host struct 
//    (or do something else with it if this is not needed)
h_a.arr = h_arr;

Solution 2

As pointed out by Mark Harris, structures can be passed by values to CUDA kernels. However, some care should be devoted to set up a proper destructor since the destructor is called at exit from the kernel.

Consider the following example

#include <stdio.h>

#include "Utilities.cuh"

#define NUMBLOCKS  512
#define NUMTHREADS 512 * 2

/***************/
/* TEST STRUCT */
/***************/
struct Lock {

    int *d_state;

    // --- Constructor
    Lock(void) {
        int h_state = 0;                                        // --- Host side lock state initializer
        gpuErrchk(cudaMalloc((void **)&d_state, sizeof(int)));  // --- Allocate device side lock state
        gpuErrchk(cudaMemcpy(d_state, &h_state, sizeof(int), cudaMemcpyHostToDevice)); // --- Initialize device side lock state
    }

    // --- Destructor (wrong version)
    //~Lock(void) { 
    //  printf("Calling destructor\n");
    //  gpuErrchk(cudaFree(d_state)); 
    //}

    // --- Destructor (correct version)
//  __host__ __device__ ~Lock(void) {
//#if !defined(__CUDACC__)
//      gpuErrchk(cudaFree(d_state));
//#else
//
//#endif
//  }

    // --- Lock function
    __device__ void lock(void) { while (atomicCAS(d_state, 0, 1) != 0); }

    // --- Unlock function
    __device__ void unlock(void) { atomicExch(d_state, 0); }
};

/**********************************/
/* BLOCK COUNTER KERNEL WITH LOCK */
/**********************************/
__global__ void blockCounterLocked(Lock lock, int *nblocks) {

    if (threadIdx.x == 0) {
        lock.lock();
        *nblocks = *nblocks + 1;
        lock.unlock();
    }
}

/********/
/* MAIN */
/********/
int main(){

    int h_counting, *d_counting;
    Lock lock;

    gpuErrchk(cudaMalloc(&d_counting, sizeof(int)));

    // --- Locked case
    h_counting = 0;
    gpuErrchk(cudaMemcpy(d_counting, &h_counting, sizeof(int), cudaMemcpyHostToDevice));

    blockCounterLocked << <NUMBLOCKS, NUMTHREADS >> >(lock, d_counting);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    gpuErrchk(cudaMemcpy(&h_counting, d_counting, sizeof(int), cudaMemcpyDeviceToHost));
    printf("Counting in the locked case: %i\n", h_counting);

    gpuErrchk(cudaFree(d_counting));
}

with the uncommented destructor (do not pay too much attention on what the code actually does). If you run that code, you will receive the following output

Calling destructor
Counting in the locked case: 512
Calling destructor
GPUassert: invalid device pointer D:/Project/passStructToKernel/passClassToKernel/Utilities.cu 37

There are then two calls to the destructor, once at the kernel exit and once at the main exit. The error message is related to the fact that, if the memory locations pointed to by d_state are freed at the kernel exit, they cannot be freed anymore at the main exit. Accordingly, the destructor must be different for host and device executions. This is accomplished by the commented destructor in the above code.

Share:
18,662
Thorkil Holm-Jacobsen
Author by

Thorkil Holm-Jacobsen

Updated on June 10, 2022

Comments

  • Thorkil Holm-Jacobsen
    Thorkil Holm-Jacobsen almost 2 years

    I'm working on a project where I need my CUDA device to make computations on a struct containing pointers.

    typedef struct StructA {
        int* arr;
    } StructA;
    

    When I allocate memory for the struct and then copy it to the device, it will only copy the struct and not the content of the pointer. Right now I'm working around this by allocating the pointer first, then set the host struct to use that new pointer (which resides on the GPU). The following code sample describes this approach using the struct from above:

    #define N 10
    
    int main() {
    
        int h_arr[N] = {1,2,3,4,5,6,7,8,9,10};
        StructA *h_a = (StructA*)malloc(sizeof(StructA));
        StructA *d_a;
        int *d_arr;
    
        // 1. Allocate device struct.
        cudaMalloc((void**) &d_a, sizeof(StructA));
    
        // 2. Allocate device pointer.
        cudaMalloc((void**) &(d_arr), sizeof(int)*N);
    
        // 3. Copy pointer content from host to device.
        cudaMemcpy(d_arr, h_arr, sizeof(int)*N, cudaMemcpyHostToDevice);
    
        // 4. Point to device pointer in host struct.
        h_a->arr = d_arr;
    
        // 5. Copy struct from host to device.
        cudaMemcpy(d_a, h_a, sizeof(StructA), cudaMemcpyHostToDevice);
    
        // 6. Call kernel.
        kernel<<<N,1>>>(d_a);
    
        // 7. Copy struct from device to host.
        cudaMemcpy(h_a, d_a, sizeof(StructA), cudaMemcpyDeviceToHost);
    
        // 8. Copy pointer from device to host.
        cudaMemcpy(h_arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);
    
        // 9. Point to host pointer in host struct.
        h_a->arr = h_arr;
    }
    

    My question is: Is this the way to do it?

    It seems like an awful lot of work, and I remind you that this is a very simple struct. If my struct contained a lot of pointers or structs with pointers themselves, the code for allocation and copy will be quite extensive and confusing.