CUDA kernel doesn't launch

12,762

Solution 1

Are you sure that your CUDA device supports the SM_20 architecture?

Remove the arch= option from your nvcc command line and rebuild everything. This compiles for the 1.0 CUDA architecture, which will be supported on all CUDA devices. If it still doesn't run, do a build clean and make sure there are no object files left anywhere. Then rebuild and run.

Also, arch= refers to the virtual architecture, which should be something like compute_10. sm_20 is the real architecture and I believe should be used with the code= switch, not arch=.

Solution 2

The reason it is not printing when using printf is that kernel launches are asynchronous and your program is exiting before the printf buffer gets flushed. Section B.16 of the CUDA (5.0) C Programming Guide explains this.

The output buffer for printf() is set to a fixed size before kernel launch (see Associated Host-Side API). It is circular and if more output is produced during kernel execution than can fit in the buffer, older output is overwritten. It is flushed only when one of these actions is performed:

  • Kernel launch via <<<>>> or cuLaunchKernel() (at the start of the launch, and if the CUDA_LAUNCH_BLOCKING environment variable is set to 1, at the end of the launch as well),
  • Synchronization via cudaDeviceSynchronize(), cuCtxSynchronize(), cudaStreamSynchronize(), cuStreamSynchronize(), cudaEventSynchronize(), or cuEventSynchronize(),
  • Memory copies via any blocking version of cudaMemcpy*() or cuMemcpy*(),
  • Module loading/unloading via cuModuleLoad() or cuModuleUnload(),
  • Context destruction via cudaDeviceReset() or cuCtxDestroy().

For this reason, this program prints nothing:

#include <stdio.h>

__global__ void myKernel() 
{ 
  printf("Hello, world from the device!\n"); 
} 

int main() 
{ 
  myKernel<<<1,10>>>(); 
} 

But this program prints "Hello, world from the device!\n" ten times.

#include <stdio.h>

__global__ void myKernel() 
{ 
  printf("Hello, world from the device!\n"); 
} 

int main() 
{ 
  myKernel<<<1,10>>>(); 
  cudaDeviceSynchronize();
} 
Share:
12,762
Tarek
Author by

Tarek

Updated on June 06, 2022

Comments

  • Tarek
    Tarek almost 2 years

    My problem is very much like this one. I run the simplest CUDA program but the kernel doesn't launch. However, I am sure that my CUDA installation is ok, since I can run complicated CUDA projects consisting of several files (which I took from someone else) with no problems. In these projects, compilation and linking is done through makefiles with a lot of flags. I think the problem is in the correct flags to use while compiling. I simply use a command like this: nvcc -arch=sm_20 -lcudart test.cu with a such a program (to run on a linux machine):

     __global__ void myKernel() 
    { 
    
        cuPrintf("Hello, world from the device!\n"); 
    
    
    } 
    int main() 
    { 
        cudaPrintfInit(); 
        myKernel<<<1,10>>>(); 
        cudaPrintfDisplay(stdout, true);    
        cudaPrintfEnd(); 
    } 
    

    The program compiles correctly. When I add cudaMemcpy() operations, it returns no error. Any suggestion on why the kernel doesn't launch ?

  • Tarek
    Tarek over 11 years
    Thanks. I removed it and the kernel printed finally using cuPrintf.
  • Tarek
    Tarek over 11 years
    I now remembered that I had to use '-arch=sm_20' in the first place because I perform atomicAdd operations on float variables, and this can't be done with sm_10. Is there any alternative ?
  • dthorpe
    dthorpe over 11 years
    Find out what your hardware is capable of. It's difficult to run code that your hardware doesn't support. ;>
  • talonmies
    talonmies over 11 years
    cudaPrintfDisplay implicitly synchronizes the context, so that isn´t the problem in the original code.
  • harrism
    harrism over 11 years
    Thanks, I removed the last line from my answer so it no longer indicates otherwise.