NVRM: Xid: 79, GPU has fallen off the bus
As per Xid errors list (check PDF), the error 79 (GPU has fallen off the bus) can be related to variety of things such as driver or hardware issue, system memory corruption, bus error or thermal issue (overheating).
Run NVIDIA X Server Settings app (which comes with the drivers) and check the temperature, graphic clock, performance levels and GPU utilization levels.
The following post (based on this original thread), suggests to disable PCI-E ASPM (Active State Power Management) by changing boot params to pcie_aspm=off
(it forcibly disables PCIe ASPM).
Related bug report: GPU has fallen off the bus.
Related videos on Youtube
Mitar
Updated on September 18, 2022Comments
-
Mitar over 1 year
I am trying to do some deep-learning on my GeForce GTX 980 Ti GPU. I have a 658W power supply, but when I start running TensorFlow, I get the following error in dmesg:
[ 158.598263] ata2: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen [ 158.598268] ata2: irq_stat 0x00400040, connection status changed [ 158.598271] ata2: SError: { HostInt PHYRdyChg 10B8B DevExch } [ 158.598277] ata2: hard resetting link [ 159.602605] NVRM: GPU at PCI:0000:01:00: GPU-e29ec6c5-5146-95c4-f09c-68b96546640b [ 159.602609] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus. [ 159.602613] NVRM: GPU at 0000:01:00.0 has fallen off the bus. [ 159.602623] NVRM: A GPU crash dump has been created. If possible, please run NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded. [ 164.230199] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 164.237244] ata2.00: configured for UDMA/133 [ 164.237248] ata2: EH complete
It seems like a small power surge which throws down my hard drive and graphical card. So I wonder, maybe I could ramp up my GPU slowly, so that it starts using more and more power in a slower manner so that it does not create this surge?
I use Ubuntu 16.04.1 with 4.8.0-34-generic kernel, with 375.26 nvidia kernel version.
nvidia-smi Tue Feb 7 15:02:47 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 375.26 Driver Version: 375.26 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 980 Ti Off | 0000:01:00.0 Off | N/A | | 0% 42C P0 56W / 275W | 0MiB / 6077MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
I tried connecting the GPU to its own power supply (older 750W which I cannot use directly on this mother board), but a similar thing happens:
[ 81.865432] NVRM: GPU at PCI:0000:01:00: GPU-e29ec6c5-5146-95c4-f09c-68b96546640b [ 81.865437] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus. [ 81.865474] NVRM: GPU at 0000:01:00.0 has fallen off the bus. [ 81.865484] NVRM: A GPU crash dump has been created. If possible, please run NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded.
And the extra power supply turns off. So it seems they really do not like when GPU gets activated.
-
guest about 7 yearssame problem. hard drive comes back quickly, graphics card stays dead
-
sawdust about 7 years"It seems like a small power surge..." -- Seems like you're making a WAG without any corroboration. Hence you're asking an XY question. Did you even try collecting the crash dump for a bug report? The odds are that this issue has nothing to do with power. I.E. see cyberciti.biz/faq/…
-
Mitar about 7 yearsBecause I do have the newest driver, pretty recent kernel, I tried the persistence mode. And because there is also ATA issues at the same time. This is why I am guessing that it is a power surge. Because I tried mostly everything else I could imagine. But feel free to propose other things to try.
-
Mitar about 7 yearsAnd yes, I collected the crash dump, but I can only send it to Nvidia. It is not really useful for me. It seems they encode/encrypt it in some way.
-
Reuben Morais about 7 yearsI'm running into the same problem with Titan X Pascal cards. Did you find a solution?
-
Mitar about 7 yearsNo. I gave up for now on it.
-
-
kensai almost 4 yearsthis didn't work for my case, but system was more stable, actually I only started X via startx and after those 3 main windows appeared I switched to other terminal console ctrl+alt Fx and logged in, it was not stable for about 10-15 minutes (for first time), but when I switched back to X ctrl+alt+f7, crash appeared.