Tensorflow complains that no CUDA-capable device is detected
I finally had the idea to look for any files with 390.77 in the name.
$ locate 390.77
/usr/lib/i386-linux-gnu/libcuda.so.390.77
/usr/lib/i386-linux-gnu/libnvcuvid.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-compiler.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-encode.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-fatbinaryloader.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-ml.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-opencl.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.390.77
/usr/lib/i386-linux-gnu/vdpau/libvdpau_nvidia.so.390.77
/usr/lib/x86_64-linux-gnu/libcuda.so.390.77
/usr/lib/x86_64-linux-gnu/libnvcuvid.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.390.77
/usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.390.77
So there they are! A closer look shows that I must have installed the newer version at some point.
$ ls /usr/lib/i386-linux-gnu/libcuda* -l
lrwxrwxrwx 1 root root 12 Nov 8 13:58 /usr/lib/i386-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root 17 Nov 12 14:04 /usr/lib/i386-linux-gnu/libcuda.so.1 -> libcuda.so.390.77
-rw-r--r-- 1 root root 9179124 Jan 31 2018 /usr/lib/i386-linux-gnu/libcuda.so.390.30
-rw-r--r-- 1 root root 9179796 Jul 10 2018 /usr/lib/i386-linux-gnu/libcuda.so.390.77
Where did they come from?
$ dpkg -S /usr/lib/i386-linux-gnu/libcuda.so.390.30
libcuda1-390: /usr/lib/i386-linux-gnu/libcuda.so.390.30
$ dpkg -S /usr/lib/i386-linux-gnu/libcuda.so.390.77
dpkg-query: no path found matching pattern /usr/lib/i386-linux-gnu/libcuda.so.390.77
So the 390.77 no longer belongs to any package. Perhaps I installed the old version and had to force it to overwrite the links.
My plan is to delete the files, then reinstall the packages to set up the links to the correct version. So which packages will I need to reinstall?
$ locate 390.77|sed -e 's/390.77/390.30/'|xargs dpkg -S
Some of the files don't match anything, but the ones that do match are from these packages:
- libcuda1-390
- nvidia-opencl-icd-390
Crossing my fingers, I delete the version 390.77 files.
locate 390.77|sudo xargs rm
Then I reinstall the packages.
sudo apt-get install --reinstall libcuda1-390 nvidia-opencl-icd-390
Finally, it works!
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 python -c "import tensorflow; tensorflow.Session()"
2019-02-06 22:13:59.460822: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-06 22:13:59.665756: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-06 22:13:59.666205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 3.81GiB
2019-02-06 22:13:59.666226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-06 22:17:21.254445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-06 22:17:21.254489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-06 22:17:21.254496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-06 22:17:21.290992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3539 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0)
nvidia-smi
also works now.
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 nvidia-smi
Wed Feb 6 22:19:24 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 960M Off | 00000000:01:00.0 Off | N/A |
| N/A 45C P8 N/A / N/A | 113MiB / 4046MiB | 6% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3212 G /usr/lib/xorg/Xorg 113MiB |
+-----------------------------------------------------------------------------+
I rebooted, and the video drivers continued to work. Hurrah!
Related videos on Youtube
Don Kirkby
Python, Java, and C# developer working in AIDS research. Hobbies include designing board games and puzzles, as well as learning Chinese. If you just want to see the codez, check out GitHub. To contact me, use Twitter or e-mail [email protected] .
Updated on June 04, 2022Comments
-
Don Kirkby almost 2 years
I'm trying to run some Tensorflow code, and I get what seems to be a common problem:
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 python -c "import tensorflow; tensorflow.Session()" 2019-02-06 20:36:15.903204: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-02-06 20:36:15.908809: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected 2019-02-06 20:36:15.908858: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: tigris 2019-02-06 20:36:15.908868: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: tigris 2019-02-06 20:36:15.908942: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 390.77.0 2019-02-06 20:36:15.908985: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 390.30.0 2019-02-06 20:36:15.909006: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:308] kernel version 390.30.0 does not match DSO version 390.77.0 -- cannot find working devices in this configuration $
The key pieces of that error message seem to be:
[...] libcuda reported version is: 390.77.0 [...] kernel reported version is: 390.30.0 [...] kernel version 390.30.0 does not match DSO version 390.77.0 -- cannot find working devices in this configuration
How can I install compatible versions? Where is that libcuda version coming from?
Background
A few months ago, I tried installing Tensorflow with GPU support, but the versions either broke my display or wouldn't work with Tensorflow. Finally, I got it working by following a tutorial on how to install multiple versions of the CUDA libraries on the same machine. That worked at the time, but when I came back to the project after a few months, it has stopped working. I assume that some driver got upgraded during that time.
Investigation
The first thing I tried was to see what versions I have of the nvidia drivers and libcuda package.
$ dpkg --list|grep libcuda ii libcuda1-390 390.30-0ubuntu1 amd64 NVIDIA CUDA runtime library
Looks like it's 390.30. Why does the error message say that libcuda reported 390.77?
$ dpkg --list|grep nvidia ii libnvidia-container-tools 1.0.1-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.0.1-1 amd64 NVIDIA container runtime library rc nvidia-384 384.130-0ubuntu0.16.04.1 amd64 NVIDIA binary driver - version 384.130 ii nvidia-390 390.30-0ubuntu1 amd64 NVIDIA binary driver - version 390.30 ii nvidia-390-dev 390.30-0ubuntu1 amd64 NVIDIA binary Xorg driver development files rc nvidia-396 396.44-0ubuntu1 amd64 NVIDIA binary driver - version 396.44 ii nvidia-container-runtime 2.0.0+docker18.09.1-1 amd64 NVIDIA container runtime ii nvidia-container-runtime-hook 1.4.0-1 amd64 NVIDIA container runtime hook ii nvidia-docker2 2.0.3+docker18.09.1-1 all nvidia-docker CLI wrapper ii nvidia-modprobe 390.30-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files rc nvidia-opencl-icd-384 384.130-0ubuntu0.16.04.1 amd64 NVIDIA OpenCL ICD ii nvidia-opencl-icd-390 390.30-0ubuntu1 amd64 NVIDIA OpenCL ICD rc nvidia-opencl-icd-396 396.44-0ubuntu1 amd64 NVIDIA OpenCL ICD ii nvidia-prime 0.8.8.2 all Tools to enable NVIDIA's Prime ii nvidia-settings 396.44-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
Again, everything looks like it's 390.30. There were some packages that had version 390.77, but they were in the
rc
status. I guess I installed that version and later removed it, so the configuration files were left behind. I purged the configuration files with commands like this:sudo apt-get remove --purge nvidia-kernel-common-390
Now, there are no packages at all with version 390.77.
$ dpkg --list|grep 390.77 $
I tried reinstalling CUDA, to see if it had been compiled with the wrong version.
$ sudo sh cuda_9.0.176_384.81_linux.run --silent --toolkit --toolkitpath=/usr/local/cuda-9.0 --override
That didn't make any difference.
Finally, I tried running nvidia-smi.
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 nvidia-smi Failed to initialize NVML: Driver/library version mismatch $
All of this is running on Ubuntu 18.04 with Python 3.6.7, and my graphics card is NVIDIA Corporation GM107M [GeForce GTX 960M] (rev a2).