Random freezes on Ubuntu 18.04 with error: "watchdog: BUG: soft lockup - CPU#11 stuck for 22s" followed by "NMI watchdog hard LOCKUP"

5,452

Solution 1

From the comments...

You might check with AMD, as they had some Ryzen processor recalls due to Linux problems. See this bug report.

I looked at that lengthy bug report, and it looks like software solutions are all over the place... some with luck... some without. I'd contact AMD and see if they'll replace your processor.

Solution 2

If you have it in your motherboard BIOS (I have an Asus Prime X370 Pro motherboard) try and disable the C6 power functionality and set it from automatic to manual.

There are kernel options if you don't have the option in your BIOS/UEFI. Although you need to check if this is suitable for your kernel version because the later kernel versions have disabled some or all of this functionality (CONFIG_RCU_NOCB_CPU_ALL is reportably gone, but may be reintroduced by kernel debugging functionality: RCU_NOCB_CPU which should be enabled as should the RCU_EXPERT kernel option which enables RCU_NOCB_CPU - without these CPU options there can be no software workaround).

So in the kernel command line for 12 thread CPUs (If you cannot disable the C6 functionality in BIOS / UEFI):

rcu_nocbs=0-11

For 16 thread CPUs:

rcu_nocbs=0-15

Essentially, as I understand it Linux requests that the system reduces the voltage to the CPU too much and the motherboard will allow it to do so which results in Linux locking up.

Symptoms include: unresponsible keyboard / mouse input, whatever is on the screen freezes there, and the system is unresponsive to ssh, although will still ping. If there is sound playing then the last of the audio buffer will play out, repeat 2-3 times and then stop. There is nothing in /var/log/messages. This may happen once or twice a month. At totally unpredictable times - normally when I am surfing the net.

If you can, try to disable this in your motherboard BIOS / UEFI, as the hardware shouldn't allow the system to drop the power this low. The software kernel option is complicated, since it depends on kernel changes.

This issue has been bothering me for years, but i was too busy and it's not been frequent enough for me to spend time resolving it. This week after a 2.5 hour FSCK, i'd had enough. Since disabling the features in the UEFI, the problem hasn't recurred.

Share:
5,452
smooshie
Author by

smooshie

Updated on September 18, 2022

Comments

  • smooshie
    smooshie over 1 year

    About once a week, my PC will completely freeze up. I can't ssh into it, the mouse will work for a few seconds and then stop, REISUB doesn't work, the only solution is a hard reboot.

    I can't find anything significant present in any logs, but if I happen to be in a virtual terminal when the freeze occurs, the following messages pop up:

    enter image description here

    I've searched for that error, but most people reporting it seem to be getting it on boot or install, mine just randomly happens.

    I'm running a dual-boot system: Windows 10 & Ubuntu 18.04. AMD Ryzen 7 CPU, NVIDIA 1060 6 GB GPU.

    • Boris Hamanov
      Boris Hamanov over 5 years
      You might check with AMD, as they had some Ryzen processor recalls due to Linux problems.
    • smooshie
      smooshie over 5 years
      Aha, that's gotta be it! bugzilla.kernel.org/show_bug.cgi?id=196683 match my complaints exactly, the lack of logging, the random hanging, even explains why it tends to hang when I've got nothing significant running. The recommended workaround seems to be disabling a "C6 state" or "Typical Current Idle" setting. I'll try disabling the C6 setting and see what happens. In the meantime, what should I do with this question?
    • Boris Hamanov
      Boris Hamanov over 5 years
      I looked at that lengthy bug report, and it looks like software solutions are all over the place... some with luck... some without. I'd contact AMD and see if they'll replace your processor. I've briefly summarized our comments in an answer. If you believe that I've lead you down the correct path, please accept the answer. Thanks!
  • smooshie
    smooshie over 5 years
    Aight, I read through the thread and turned off the c6 state on the CPU using zenstates.py, will see if that works, if not time to call AMD. Thanks for the direction, I would not have suspected a bug like that!
  • smooshie
    smooshie almost 5 years
    Noticed this question is getting a lot of views; as an update, disabling the c6 state did indeed work, I have not gotten any further freezes to date.