cudaMallocHost vs malloc for better performance shows no difference

13,557

Solution 1

If you want to see the difference in execution time for the copy operation, just time the copy operation. In many cases you will see approximately a 2x difference in execution time for just the copy operation when the underlying mememory is pinned. And make your copy operation large enough/long enough so that you are well above the granularity of whatever timing mechanism you are using. The various profilers such as the visual profiler and nvprof can help here.

The cudaMallocHost operation under the hood is doing something like a malloc plus additional OS functions to "pin" each page associated with the allocation. These additional OS operations take extra time, as compared to just doing a malloc. And note that as the size of the allocation increases, the registration ("pinning") cost will generally increase as well.

Therefore, for many examples, just timing the overall execution doesn't show much difference, because while the cudaMemcpy operation may be quicker from pinned memory, the cudaMallocHost takes longer than the corresponding malloc.

So what's the point?

  1. You may be interested in using pinned memory (i.e. cudaMallocHost) when you will be doing repeated transfers from a single buffer. You only pay the extra cost to pin it once, but you benefit on each transfer/usage.
  2. Pinned memory is required to overlap a data transfer operations (cudaMemcpyAsync) with compute activities (kernel calls). Refer to the programming guide.

Solution 2

I too found that just declaring cudaHostAlloc / cudaMallocHost on a piece of memory doesn't do much. To be sure, do a nvprof with --print-gpu-trace and see whether the throughput for memcpyHtoD or memcpyDtoH is good. For PCI2.0, you should get around 6-8gbps.

However, pinned memory is a perquisite for cudaMemcpyAsync. After I called cudaMemcpyAsync, I shifted whatever computations I had on the host right after it. In this way you can "layer" the asynchronous memcpys with the host computations.

I was surprised that I was able to save quite a lot of time this way, it's worth a try.

Share:
13,557
user3545251
Author by

user3545251

Updated on June 04, 2022

Comments

  • user3545251
    user3545251 almost 2 years

    I have gone through this site. From here I got that pinned memory using cudamallocHost gives better performance than cudamalloc. Then I use two different simple program and tested the execution time as

    using cudaMallocHost

    #include <stdio.h>
    #include <cuda.h>
    
    // Kernel that executes on the CUDA device
    __global__ void square_array(float *a, int N)
    {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      if (idx<N) a[idx] = a[idx] * a[idx];
    }
    
    // main routine that executes on the host
    int main(void)
    {
        clock_t start;
        start=clock();/* Line 8 */
        clock_t finish;
      float *a_h, *a_d;  // Pointer to host & device arrays
      const int N = 100000;  // Number of elements in arrays
      size_t size = N * sizeof(float);
      cudaMallocHost((void **) &a_h, size);
      //a_h = (float *)malloc(size);        // Allocate array on host
      cudaMalloc((void **) &a_d, size);   // Allocate array on device
      // Initialize host array and copy it to CUDA device
      for (int i=0; i<N; i++) a_h[i] = (float)i;
      cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
      // Do calculation on device:
      int block_size = 4;
      int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
      square_array <<< n_blocks, block_size >>> (a_d, N);
      // Retrieve result from device and store it in host array
      cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
      // Print results
      for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
      // Cleanup
      cudaFreeHost(a_h);
      cudaFree(a_d);
      finish = clock() - start;
          double interval = finish / (double)CLOCKS_PER_SEC; 
          printf("%f seconds elapsed", interval);
    }
    

    using malloc

    #include <stdio.h>
    #include <cuda.h>
    
    // Kernel that executes on the CUDA device
    __global__ void square_array(float *a, int N)
    {
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      if (idx<N) a[idx] = a[idx] * a[idx];
    }
    
    // main routine that executes on the host
    int main(void)
    {
        clock_t start;
        start=clock();/* Line 8 */
        clock_t finish;
      float *a_h, *a_d;  // Pointer to host & device arrays
      const int N = 100000;  // Number of elements in arrays
      size_t size = N * sizeof(float);
      a_h = (float *)malloc(size);        // Allocate array on host
      cudaMalloc((void **) &a_d, size);   // Allocate array on device
      // Initialize host array and copy it to CUDA device
      for (int i=0; i<N; i++) a_h[i] = (float)i;
      cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
      // Do calculation on device:
      int block_size = 4;
      int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
      square_array <<< n_blocks, block_size >>> (a_d, N);
      // Retrieve result from device and store it in host array
      cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
      // Print results
      for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
      // Cleanup
      free(a_h); cudaFree(a_d);
      finish = clock() - start;
          double interval = finish / (double)CLOCKS_PER_SEC; 
          printf("%f seconds elapsed", interval);
    }
    

    here during execution of both program, the execution time was almost similar. Is there anything wrong in the implementation?? what is the exact difference in execution in cudamalloc and cudamallochost??

    and also with each run the execution time decreases