CUDA kernel failed : no kernel image is available for execution on the device, Error when running PyTorch model inside Google Compute VM

11,513

I resolved this in the end by manually deleting all the folders except for "src" in the folder containing setup.py

Then rebuilt the docker image

Then when building the image I ran TORCH_CUDA_ARCH_LIST="6.1" python setup.py install, to install the cuda extensions targeting the correct compute capability for the GPU on the VM

and it worked!

I guess just running setup.py without deleting the folders previously installed doesn't fully overwrite the extension

Share:
11,513

Related videos on Youtube

user3882675
Author by

user3882675

Updated on June 04, 2022

Comments

  • user3882675
    user3882675 almost 2 years

    I have a docker image of a PyTorch model that returns this error when run inside a google compute engine VM running on debian/Tesla P4 GPU/google deep learning image:

    CUDA kernel failed : no kernel image is available for execution on the device
    

    This occurs on the line where my model is called. The PyTorch model includes custom c++ extensions, I'm using this model https://github.com/daveredrum/Pointnet2.ScanNet

    My image installs these at runtime

    The image runs fine on my local system. Both VM and my system have these versions:

    Cuda compilation tools 10.1, V10.1.243

    torch 1.4.0

    torchvision 0.5.0

    The main difference is the GPU as far as I'm aware

    Local:

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GTX 960M    Off  | 00000000:01:00.0 Off |                  N/A |
    | N/A   36C    P8    N/A /  N/A |    361MiB /  2004MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    

    VM:

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
    | N/A   42C    P0    23W /  75W |      0MiB /  7611MiB |      3%      Default |
    

    If I ssh into the VM torch.cuda.is_available() returns true

    Therefore I suspect it must be something to do with the compilation of the extensions

    This is the relevant part of my docker file:

    ENV CUDA_HOME "/usr/local/cuda-10.1"
    ENV PATH /usr/local/nvidia/bin:/usr/local/cuda-10.1/bin:${PATH}
    ENV NVIDIA_VISIBLE_DEVICES all
    ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
    ENV NVIDIA_REQUIRE_CUDA "cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411 brand=tesla,driver>=418,driver<419"
    ENV FORCE_CUDA=1
    
    # CUDA 10.1-specific steps
    RUN conda install -c open3d-admin open3d
    RUN conda install -y -c pytorch \
        cudatoolkit=10.1 \
        "pytorch=1.4.0=py3.6_cuda10.1.243_cudnn7.6.3_0" \
        "torchvision=0.5.0=py36_cu101" \
     && conda clean -ya
    RUN pip install -r requirements.txt
    RUN pip install flask
    RUN pip install plyfile
    RUN pip install scipy
    
    
    # Install OpenCV3 Python bindings
    RUN sudo apt-get update && sudo apt-get install -y --no-install-recommends \
        libgtk2.0-0 \
        libcanberra-gtk-module \
        libgl1-mesa-glx \
     && sudo rm -rf /var/lib/apt/lists/*
    
    RUN dir
    RUN cd pointnet2 && python setup.py install
    RUN cd ..
    

    I have already re-running this line from ssh in the VM:

    TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0" python setup.py install
    

    Which I think targets the installation to the Tesla P4 compute capability?

    Is there some other setting or troubleshooting step I can try?

    I didn't know anything about docker/VMs/pytorch extensions until a couple of days ago, so somewhat shooting in the dark. Also this is my first stackoverflow post, apologies if I'm not following some etiquette, feel free to point out.

    • Alexandre
      Alexandre about 4 years
      I would need to know how you are running docker, is it on a single instance, or a cluster? Did you get the Image from the Docker repository? I also need to know if you got your CUDA driver from the NVIDIA dev site? Or was the driver included in the image?