How to Fix "RuntimeError: CUDA error: device-side assert triggered" in Pytorch

17,212

Solution 1

I also had this issue while training resnet model on google colab. So in my case I was training the model on 7 classes but the last layer of my network was set to output 3 classes.

Hence I changed

self.classifier = torch.nn.Sequential(torch.nn.BatchNorm1d(512), torch.nn.Linear(512, 3))

to this

self.classifier = torch.nn.Sequential(torch.nn.BatchNorm1d(512), torch.nn.Linear(512, 7))

After doing that also, i was still getting the error, because I didn't restart the google colab.

Remember whenever you get this error, check two things:

  1. class_labels should start from 0 i.e. in my case [0,1,2,3,4,5,6] for 7 classes.
  2. Check whether the final output layer outputs the exact number of classes.

And after that,

Refresh the notebook to flush all the cuda asserts.

After any of the CUDA error, restart the notebook, otherwise you will keep on getting the CUDA error because the earlier assertion hasn't been flushed out. By restarting the notebook, you will flush out all the cuda assertions.

Solution 2

Usually when you get mysterious CUDA errors, you should switch to CPU and see if you get more meaningful error messages there.

Alternatively, set CUDA_LAUNCH_BLOCKING=1 to get a more informative stack trace (see this answer for details).

See this answer for more details.


I guess, in your case, it seems like your division by 2 creates fractions where pytorch looks for integers. Try

b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] // 2, box1[:, 0] + box1[:, 2] // 2
Share:
17,212

Related videos on Youtube

almonzer
Author by

almonzer

Updated on June 04, 2022

Comments

  • almonzer
    almonzer almost 2 years

    I am trying to train the yolo-v3 model from this repo https://github.com/eriklindernoren/PyTorch-YOLOv3 on my custom dataset of shapes, but I keep getting the error "RuntimeError: CUDA error: device-side assert triggered"

    I have tried to lookup the solution and tried several things suggested in different answers (like fixing the indexing of the classes in the annotations) but the error persists.

    I am following the description in the readme of the repo to train on a custom dataset, and have adjusted custom.data and the data/custom/ accordingly.

    I keep receiving this output.

    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [32,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [33,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [34,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [35,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [36,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [37,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [38,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [39,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [40,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [41,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [42,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [43,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [44,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [45,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [2,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [3,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [4,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [5,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [6,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [7,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [12,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [13,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [14,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [15,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [20,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [21,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [22,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [23,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [24,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [25,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [26,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [27,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [28,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [29,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    C:/w/1/s/windows/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: block: [0,0,0], thread: [31,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    Traceback (most recent call last):
      File "train.py", line 105, in <module>
        loss, outputs = model(imgs, targets)
      File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py", line 547, in __call__
        result = self.forward(*input, **kwargs)
      File "D:\Documents\GP\Code\TorchYolo\PyTorch-YOLOv3\models.py", line 259, in forward
        x, layer_loss = module[0](x, targets, img_dim)
      File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py", line 547, in __call__
        result = self.forward(*input, **kwargs)
      File "D:\Documents\GP\Code\TorchYolo\PyTorch-YOLOv3\models.py", line 188, in forward
        ignore_thres=self.ignore_thres,
      File "D:\Documents\GP\Code\TorchYolo\PyTorch-YOLOv3\utils\utils.py", line 318, in build_targets
        iou_scores[b, best_n, gj, gi] = bbox_iou(pred_boxes[b, best_n, gj, gi], target_boxes, x1y1x2y2=False)
      File "D:\Documents\GP\Code\TorchYolo\PyTorch-YOLOv3\utils\utils.py", line 199, in bbox_iou
        b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2
    RuntimeError: CUDA error: device-side assert triggered
    
    

    with the only thing changing being the "2" in the array index when messing around with the train.jpg class label index

    b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2
    
  • almonzer
    almonzer over 4 years
    when using the CUDA_LAUNCH_BLOCKING=1 (CUDA_LAUNCH_BLOCKING=1 python train.py --model_def config/yolov3-custom.cfg --data_config config/custom.data) I get This Error: ''' CUDA_LAUNCH_BLOCKING=1 : The term 'CUDA_LAUNCH_BLOCKING=1' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again. '''
  • Karthik Sunil
    Karthik Sunil over 3 years
    I did both your suggestion and it worked. Thanks