NVRM: Xid: 79, GPU has fallen off the bus

5,446

As per Xid errors list (check PDF), the error 79 (GPU has fallen off the bus) can be related to variety of things such as driver or hardware issue, system memory corruption, bus error or thermal issue (overheating).

Run NVIDIA X Server Settings app (which comes with the drivers) and check the temperature, graphic clock, performance levels and GPU utilization levels.

The following post (based on this original thread), suggests to disable PCI-E ASPM (Active State Power Management) by changing boot params to pcie_aspm=off (it forcibly disables PCIe ASPM).

Related bug report: GPU has fallen off the bus.

Share:
5,446

Related videos on Youtube

Mitar
Author by

Mitar

Updated on September 18, 2022

Comments

  • Mitar
    Mitar over 1 year

    I am trying to do some deep-learning on my GeForce GTX 980 Ti GPU. I have a 658W power supply, but when I start running TensorFlow, I get the following error in dmesg:

    [  158.598263] ata2: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
    [  158.598268] ata2: irq_stat 0x00400040, connection status changed
    [  158.598271] ata2: SError: { HostInt PHYRdyChg 10B8B DevExch }
    [  158.598277] ata2: hard resetting link
    [  159.602605] NVRM: GPU at PCI:0000:01:00: GPU-e29ec6c5-5146-95c4-f09c-68b96546640b
    [  159.602609] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
    
    [  159.602613] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
    [  159.602623] NVRM: A GPU crash dump has been created. If possible, please run
                   NVRM: nvidia-bug-report.sh as root to collect this data before
                   NVRM: the NVIDIA kernel module is unloaded.
    [  164.230199] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    [  164.237244] ata2.00: configured for UDMA/133
    [  164.237248] ata2: EH complete
    

    It seems like a small power surge which throws down my hard drive and graphical card. So I wonder, maybe I could ramp up my GPU slowly, so that it starts using more and more power in a slower manner so that it does not create this surge?

    I use Ubuntu 16.04.1 with 4.8.0-34-generic kernel, with 375.26 nvidia kernel version.

    nvidia-smi 
    Tue Feb  7 15:02:47 2017       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GTX 980 Ti  Off  | 0000:01:00.0     Off |                  N/A |
    |  0%   42C    P0    56W / 275W |      0MiB /  6077MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID  Type  Process name                               Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    

    I tried connecting the GPU to its own power supply (older 750W which I cannot use directly on this mother board), but a similar thing happens:

    [   81.865432] NVRM: GPU at PCI:0000:01:00: GPU-e29ec6c5-5146-95c4-f09c-68b96546640b
    [   81.865437] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
    
    [   81.865474] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
    [   81.865484] NVRM: A GPU crash dump has been created. If possible, please run
                   NVRM: nvidia-bug-report.sh as root to collect this data before
                   NVRM: the NVIDIA kernel module is unloaded.
    

    And the extra power supply turns off. So it seems they really do not like when GPU gets activated.

    • guest
      guest about 7 years
      same problem. hard drive comes back quickly, graphics card stays dead
    • sawdust
      sawdust about 7 years
      "It seems like a small power surge..." -- Seems like you're making a WAG without any corroboration. Hence you're asking an XY question. Did you even try collecting the crash dump for a bug report? The odds are that this issue has nothing to do with power. I.E. see cyberciti.biz/faq/…
    • Mitar
      Mitar about 7 years
      Because I do have the newest driver, pretty recent kernel, I tried the persistence mode. And because there is also ATA issues at the same time. This is why I am guessing that it is a power surge. Because I tried mostly everything else I could imagine. But feel free to propose other things to try.
    • Mitar
      Mitar about 7 years
      And yes, I collected the crash dump, but I can only send it to Nvidia. It is not really useful for me. It seems they encode/encrypt it in some way.
    • Reuben Morais
      Reuben Morais about 7 years
      I'm running into the same problem with Titan X Pascal cards. Did you find a solution?
    • Mitar
      Mitar about 7 years
      No. I gave up for now on it.
  • kensai
    kensai almost 4 years
    this didn't work for my case, but system was more stable, actually I only started X via startx and after those 3 main windows appeared I switched to other terminal console ctrl+alt Fx and logged in, it was not stable for about 10-15 minutes (for first time), but when I switched back to X ctrl+alt+f7, crash appeared.