Regular freezing on Ryzen based system, 16.04 LTS and newer kernel

16.04 kernel freeze crash amd-processor

7,209

Solution 1

I had the same problem... What I did to solve this issue:

Performance:

sudo cpufreq-set -r -g performance

Set on boot:

sudo apt-get install cpufrequtils
echo 'GOVERNOR="performance"' | sudo tee /etc/default/cpufrequtils
sudo systemctl disable ondemand

Solution 2

I had kind of the same problem as you. Ryzen 1800x

I suggest you to:

Re-enable SMT - No need to disable it.

Go back to the normal current kernel for Ubuntu 16.04 which is currently 4.4.0-93

Disable all "power saving" Global C-State options in BIOS.

Disable cool n quiet option as well.

Increase the voltage of your SoC to 1.1 for stability, this is recommended. As stated in this video: https://www.hardocp.com/news/2017/05/01/how_to_stabilize_your_amd_ryzen_memory_cpu_overclocking_attempts

The above recommendation is valid for if you are stressing the CPU or if you are idling.

Download latest AMD Drivers on the AMD website for your card. You can also try the latest open-source drivers via: "Additional Drivers" under "Software & Updates". I recommend this option first.

Before doing the above, just reset the BIOS to default and check if there is a newer version available.

7,209

ankit7540

Updated on September 18, 2022

Comments

ankit7540 over 1 year

I am running Ryzen 1700X CPU and doing computations. Every now and then the system crashes, while running 16.04 LTS (Kernel 4.10). The system does not reboot. There is no signal on display and the keyboard + mouse do not work. I cannot connect via SSH.

I saved the kern.log and syslog files while running 16.04 LTS.

After reading several posts, and reading issues about the new architecture and issues, I decided to try more recent kernel and I moved to 4.12.8 (dated 16th Aug, 2017) from here. I used this post on AskUbuntu to update the kernel. System booted fine and my application ran fine for ~10 hours now.

After about ~11 hours system crashed again, with the same messages in the syslog as seen with kernel 4.10 on 16.04 LTS, given below. {Kernel and syslog files, with 4.12 kernel: kern.log with new kernel and syslog with new kernel }

Aug 18 17:27:13 vriksha systemd[1]: Starting Cleanup of Temporary Directories...
Aug 18 17:27:13 vriksha systemd-tmpfiles[4661]: [/usr/lib/tmpfiles.d/var.conf:14] Duplicate line for path "/var/log", ignoring.
Aug 18 17:27:13 vriksha systemd[1]: Started Cleanup of Temporary Directories.
Aug 18 17:28:25 vriksha ntpd[1516]: 209.242.224.117 local addr 192.168.2.15 -> <null>
Aug 18 17:35:01 vriksha CRON[4821]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 18 17:35:40 vriksha systemd[1]: Started Session 5 of user vani.
Aug 18 17:42:18 vriksha sensord: Chip: amdgpu-pci-2700
Aug 18 17:42:18 vriksha sensord: Adapter: PCI adapter
Aug 18 17:42:18 vriksha sensord:   fan1: 1423 RPM
Aug 18 17:42:18 vriksha sensord:   temp1: 43.0 C
Aug 18 17:42:18 vriksha sensord: Chip: asus-isa-0000
Aug 18 17:42:18 vriksha sensord: Adapter: ISA adapter
Aug 18 17:42:18 vriksha sensord:   cpu_fan: 0 RPM
Aug 18 17:45:01 vriksha CRON[6142]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 18 17:55:01 vriksha CRON[6431]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 18 18:05:01 vriksha CRON[6607]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 18 18:09:52 vriksha kernel: [ 3459.913711] perf: interrupt took too long (2529 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
Aug 18 18:12:18 vriksha sensord: Chip: amdgpu-pci-2700
Aug 18 18:12:18 vriksha sensord: Adapter: PCI adapter
Aug 18 18:12:18 vriksha sensord:   fan1: 1431 RPM
Aug 18 18:12:18 vriksha sensord:   temp1: 40.0 C
Aug 18 18:12:18 vriksha sensord: Chip: asus-isa-0000
Aug 18 18:12:18 vriksha sensord: Adapter: ISA adapter
Aug 18 18:12:18 vriksha sensord:   cpu_fan: 0 RPM
Aug 18 18:15:01 vriksha CRON[6785]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Aug 18 18:17:01 vriksha CRON[6825]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 18 18:25:01 vriksha CRON[6967]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)

After the last line in the above message (in syslog) the system froze. I had to reset to reboot again. This happened again with the new kernel.

System details:

CPU-1700X Ryzen, No SMT, BIOS version- 3401 dated 12/08/2017 (AGESA 1071)
RAM 32 GB
AMD RX 470 GPU 
Lubuntu 16.04 LTS, LXDE with Openbox

Can somebody help me out.

Updates

The application I am running is not using gcc, g++.

lspci output is here.
dmesg | egrep 'drm|radeon' output is here
(root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) is related to the sysstat package which I removed. The problem still exists.

glxinfo | grep -i open output for AMD RX 470 GPU is given below

glxinfo | grep -i open 
OpenGL vendor string: X.Org
OpenGL renderer string: Gallium 0.4 on AMD POLARIS10 (DRM 3.15.0 / 4.12.8-041208-generic, LLVM 4.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 17.0.7
OpenGL core profile shading language version string: 4.50
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
OpenGL core profile extensions:
OpenGL version string: 3.0 Mesa 17.0.7
OpenGL shading language version string: 1.30
OpenGL context flags: (none)
OpenGL extensions:
OpenGL ES profile version string: OpenGL ES 3.1 Mesa 17.0.7
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.10
OpenGL ES profile extensions:

I have connected only one display to this computer. The crashes happen only when running CPU intensive tasks for long durations of time. ( I leave the system with its display off, controlling it, checking it from a SSH connection. After 5-6 hours or so, SSH connection becomes unavailable. After coming back to the machine, moving mouse and keyboard do nothing to bring the display back. A hard reset is required).

To check if this is because of GPU or not, I changed to nVidia GTX 1080 for which I installed the proprietary driver and still under the similar load, the system freezes. I changed back to AMD GPU and there the problem persists. I rule out this behavior due to GPU build type. For the nVidia card the glxinfo | grep -i open output is following;

OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: GeForce GTX 1080/PCIe/SSE2
OpenGL core profile version string: 4.5.0 NVIDIA 384.81
OpenGL core profile shading language version string: 4.50 NVIDIA
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
OpenGL core profile extensions:
OpenGL version string: 4.5.0 NVIDIA 384.81
OpenGL shading language version string: 4.50 NVIDIA
OpenGL context flags: (none)
OpenGL profile mask: (none)
OpenGL extensions:
OpenGL ES profile version string: OpenGL ES 3.2 NVIDIA 384.81
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20
OpenGL ES profile extensions:

Updated the BIOS to version 3401 (12/08/2017, AGESA 1071) and the problem persists.

ankit7540 over 6 years

I disabled SMT intentionally since the application(s) I use may suffer from cache miss and hence the numerical accuracy of results. These scenario happens in high performance computing when parallel computations for long duration.
ankit7540 over 5 years

I tried this. After running sudo systemctl disable ondemand, I received ondemand.service is not a native service, redirecting to systemd-sysv-install Executing /lib/systemd/systemd-sysv-install disable ondemand insserv: warning: current start runlevel(s) (empty) of script ondemand overrides LSB defaults (2 3 4 5). insserv: warning: current stop runlevel(s) (2 3 4 5) of script ondemand overrides LSB defaults (empty). Is this normal.