What does "RuntimeError: CUDA error: device-side assert triggered" in PyTorch mean?

python gpu pytorch

21,858

Solution 1

When a device-side error is detected while CUDA device code is running, that error is reported via the usual CUDA runtime API error reporting mechanism. The usual detected error in device code would be something like an illegal address (e.g. attempt to dereference an invalid pointer) but another type is a device-side assert. This type of error is generated whenever a C/C++ assert() occurs in device code, and the assert condition is false.

Such an error occurs as a result of a specific kernel. Runtime error checking in CUDA is necessarily asynchronous, but there are probably at least 3 possible methods to start to debug this.

Modify the source code to effectively convert asynchronous kernel launches to synchronous kernel launches, and do rigorous error-checking after each kernel launch. This will identify the specific kernel that has caused the error. At that point it may be sufficient simply to look at the various asserts in that kernel code, but you could also use step 2 or 3 below.
Run your code with cuda-memcheck. This is a tool something like "valgrind for device code". When you run your code with cuda-memcheck, it will tend to run much more slowly, but the runtime error reporting will be enhanced. It is also usually preferable to compile your code with -lineinfo. In that scenario, when a device-side assert is triggered, cuda-memcheck will report the source code line number where the assert is, and also the assert itself and the condition that was false. You can see here for a walkthrough of using it (albeit with an illegal address error instead of assert(), but the process with assert() will be similar.
It should also be possible to use a debugger. If you use a debugger such as cuda-gdb (e.g. on linux) then the debugger will have back-trace reports that will indicate which line the assert was, when it was hit.

Both cuda-memcheck and the debugger can be used if the CUDA code is launched from a python script.

At this point you have discovered what the assert is and where in the source code it is. Why it is there cannot be answered generically. This will depend on the developers intention, and if it is not commented or otherwise obvious, you will need some method to intuit that somehow. The question of "how to work backwards" is also a general debugging question, not specific to CUDA. You can use printf in CUDA kernel code, and also a debugger like cuda-gdb to assist with this (for example, set a breakpoint prior to the assert, and inspect machine state - e.g. variables - when the assert is about to be hit).

Solution 2

When I shifted my code to work on CPU instead of GPU, I got the following error:

IndexError: index 128 is out of bounds for dimension 0 with size 128

So, perhaps there might be a mistake in the code which for some strange reason comes out as a CUDA error.

Solution 3

In my case, this error is caused because my loss function just receive values between [0, 1], and i was passing other values.

So, normalizing my loss function input, solve this:

    saida_G -= saida_G.min(1, keepdim=True)[0]
    saida_G /= saida_G.max(1, keepdim=True)[0]

Read this: link

21,858

Author by

Joseph Konan

Updated on July 29, 2022

Comments

Joseph Konan almost 2 years
I have seen a lot of specific posts to particular case-specific problems, but no fundamental motivating explanation. What does this error:
```
RuntimeError: CUDA error: device-side assert triggered
```
mean? Specifically, what is the assert that is being triggered, why is the assert there, and how do we work backwards to debug the problem?

As-is, this error message is near useless in diagnosing any problem because of the generality that it seems to say "some code somewhere that touches the GPU" has a problem. The documentation of Cuda also does not seem helpful in this regard, though I could be wrong. https://docs.nvidia.com/cuda/cuda-gdb/index.html
Soren over 3 years

How do you use cuda-memcheck?
Robert Crovella over 3 years

added link to answer. also note where I say "You can see here for a walkthrough of using it ". If you click on the word "here" at that point, you can find an example of using it.
decadenza almost 3 years

This was the fastest debug strategy!