CUDA runtime error (59) : device-side assert triggered

109,624

Solution 1

In general, when encountering cuda runtine errors, it is advisable to run your program again using the CUDA_LAUNCH_BLOCKING=1 flag to obtain an accurate stack trace.

In your specific case, the targets of your data were too high (or low) for the specified number of classes.

Solution 2

I have encountered this problem several times. And I find it to be an index issue.

For example, if your ground truth label starts at 1: target = [1,2,3,4,5], then you should subtract 1 for every label, change it to: [0,1,2,3,4].

This solves my problem every time.

Solution 3

I encountered this error when running BertModel.from_pretrained('bert-base-uncased'). I found the solution by moving to the CPU when the error message changed to 'IndexError: index out of range in self'. Which led me to this post. The solution was to truncate sentences to length 512.

Solution 4

One way to raise the "CUDA error: device-side assert triggered" RuntimeError, is by indexing into a GPU torch.Tensor using a list having out of dimension indices.

So, this snippet would raise an IndexError with the message "IndexError: index 3 is out of bounds for dimension 0 with size 3", not the CUDA error

data = torch.randn((3,10), device=torch.device("cuda"))
data[3,:]

whereas, this one would raise the CUDA "device-side assert triggered" RuntimeError

data = torch.randn((3,10), device=torch.device("cuda"))
indices = [1,3]
data[indices,:]

which could mean that in case of class labels, such as in the answer by @Rainy, it's the final class label (i.e. when label == num_classes) that is causing the error, when the labels start from 1 rather than 0.

Also, when device is "cpu" the error thrown is IndexError such as the one thrown by the first snippet.

Solution 5

I found I got this error when I had a label with an invalid value.

Share:
109,624
saichand
Author by

saichand

Enthusiastic programmer. Keen to learn new fields in computer science.

Updated on July 08, 2022

Comments

  • saichand
    saichand almost 2 years

    I have access to Tesla K20c, I am running ResNet50 on CIFAR10 dataset... Then I get the error as:

    THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu line=265 error=59 : device-side assert triggered
    Traceback (most recent call last):
      File "main.py", line 109, in <module>
        train(loader_train, model, criterion, optimizer)
      File "main.py", line 54, in train
        optimizer.step()
      File "/usr/local/anaconda35/lib/python3.6/site-packages/torch/optim/sgd.py", line 93, in step
        d_p.add_(weight_decay, p.data)
    RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:265
    

    How to resolve this error?

  • Christian
    Christian about 5 years
    I can confirm, this was also the cause of error in my case. For example, valid text labels have been converted to 0..n-1 (n being the number of classes). However, NaN values were converted to -1, which sent it off the rails.
  • Kunj Mehta
    Kunj Mehta over 4 years
    @Rainy can you elaborate on "ground truth label starts at 1". What do you mean by that? I gather that the labels are 1 to 5 and to overcome the error the first value in the error should be zero. Am I right?
  • Chandra
    Chandra over 4 years
    @KunjMehta, Not just first value should be zero. Class index should start from zero. e.g. for 6 classes, index values should be from 0 to 5.
  • Eric Wiener
    Eric Wiener over 3 years
    To add to this, once you get a more accurate stack trace and locate where the issue is, you can move your tensors to CPU. Moving the tensors to CPU will give much more detailed errors. Combining CUDA_LAUNCH_BLOCKING=1 with moving the tensors to CPU was the only way I was able to solve a problem I spent 3 days on.
  • Nihat
    Nihat over 3 years
    I get the error even though I have the setup you offer
  • Oras
    Oras over 2 years
    saved my day! Thank you.
  • Vinod Kumar Chauhan
    Vinod Kumar Chauhan about 2 years
    Even in my case, the issue was with the invalid value of labels as I forgot to put activation in the last layer. Thanks!