How to fix "BUG: soft lockup - CPU#0 stuck for 17163091968s"?
Solution 1
Thanks to all commenters. I think I found the answer. There seems to be a timekeeping bug in at least Ubuntu's kernel version 2.6.32-30-server. The bug sometimes (?) kills machines when they reach an uptime of about 200..210 days. Actually the halt does not happen immediately after the limit is reached, but is triggered by some operation (in my case: apt-get install ...
).
NB: 200 days is about 2^32 times 1/250 second, and 250 is the default value for CONFIG_HZ.
For now, I haven't found data on whether the problem has been fixed in more recent kernels. I do know that it doesn't seem to affect an older kernel (2.6.32-26-server). From all this information I presume that if it's not fixed yet, it can be avoided by:
- boot the machines every 190 days (a good idea for kernel upgrades anyway)
- adjust CONFIG_HZ to 100 and thus make it every 497 days. However, this might have quite unexpected side effects, especially in virtual environments. And it doesn't solve the problem.
Here's a bug report for Ubuntu.
Solution 2
This is actually a kernel bug that got fixed by the following kernel commit :
You can search LKML for the following title (cannot post more than 2 links) : [stable] 2.6.32.21 - uptime related crashes?
And this is the LP# bug that brings the kernel fix :
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/902317
Upgrading to the latest kernel in lucid-updates should fix this issue for good.
HTH
Solution 3
Could it be that the virtualisation host has some power-saving features ("Green IT") enabled that could send unused cores into a low-power/sleep mode, causing interesting disruptions in the VMs using that core? I've heard this used to be a problem mainly in HyperV environments but it may be something to look into.
Solution 4
In case someone else finds this, a kernel upgrade fixed a similar problem for me. I had a JBOD that was attached to the system through a SAS3 controller throwing these CPU Softlock errors on boot.
I had Ubuntu 14.04.2 kernel version 3.16.0-30, and doing a "apt -y upgrade" ended me up at kernel 3.16.0-49, and that solved the problem.
Related videos on Youtube
tuomassalo
Updated on September 18, 2022Comments
-
tuomassalo almost 2 years
UPDATE: I updated the title of the message, because I've recently seen more of these problems with this exact time amount of
17163091968s
. This should help people investigating the symptoms to find this page. See my (self-)accepted answer below.
I have a bunch of 64-bit Ubuntu 10.04 LTS VM's in a VMware vSphere datacenter. VMware tools is installed (vSphere Client says "OK").
I've seen some of the VM's hang a few times with the following error in syslog. When checking the situation from vSphere, the console was black, and the "Reboot guest" command didn't do anything, so I had to power cycle the VM.
Dec 1 11:44:15 s0 kernel: [18446744060.007150] BUG: soft lockup - CPU#0 stuck for 17163091988s! [jed:26674] Dec 1 11:44:15 s0 kernel: [18446744060.026854] Modules linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs exportfs reiserfs xt_tcpudp iptable_filter ip_tables x_tables acpiphp fbcon tileblit font bitblit softcursor ppdev vga16fb psmouse parport_pc shpchp vgastate i2c_piix4 lp parport serio_raw intel_agp floppy mptspi mptscsih vmw_pvscsi e1000 mptbase Dec 1 11:44:15 s0 kernel: [18446744060.026899] CPU 0: Dec 1 11:44:15 s0 kernel: [18446744060.026900] Modules linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs exportfs reiserfs xt_tcpudp iptable_filter ip_tables x_tables acpiphp fbcon tileblit font bitblit softcursor ppdev vga16fb psmouse parport_pc shpchp vgastate i2c_piix4 lp parport serio_raw intel_agp floppy mptspi mptscsih vmw_pvscsi e1000 mptbase Dec 1 11:44:15 s0 kernel: [18446744060.026920] Pid: 26674, comm: jed Not tainted 2.6.32-30-server #59-Ubuntu VMware Virtual Platform Dec 1 11:44:15 s0 kernel: [18446744060.026922] RIP: 0033:[<00007f92e03d2ce6>] [<00007f92e03d2ce6>] 0x7f92e03d2ce6 Dec 1 11:44:15 s0 kernel: [18446744060.026930] RSP: 002b:00007fff6069b770 EFLAGS: 00000202 Dec 1 11:44:15 s0 kernel: [18446744060.026932] RAX: 00007f92e27e7e10 RBX: 00007f92e06d5e40 RCX: 0000000000020000 Dec 1 11:44:15 s0 kernel: [18446744060.026933] RDX: 00007f92e27e7e10 RSI: 0000000000020209 RDI: 0000000000000002 Dec 1 11:44:15 s0 kernel: [18446744060.026934] RBP: ffffffff81013cae R08: 0000000000000001 R09: 0000000000000000 Dec 1 11:44:15 s0 kernel: [18446744060.026935] R10: 00007f92e06d6398 R11: 0000000000000870 R12: 00000000000000c0 Dec 1 11:44:15 s0 kernel: [18446744060.026937] R13: 00007f92e299dca0 R14: 0000000000000020 R15: 00007f92e06d5e40 Dec 1 11:44:15 s0 kernel: [18446744060.026939] FS: 00007f92e105b700(0000) GS:ffff880009c00000(0000) knlGS:0000000000000000 Dec 1 11:44:15 s0 kernel: [18446744060.026940] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 1 11:44:15 s0 kernel: [18446744060.026941] CR2: 00007ff12ea15000 CR3: 0000000267067000 CR4: 00000000000006f0 Dec 1 11:44:15 s0 kernel: [18446744060.026968] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 1 11:44:15 s0 kernel: [18446744060.026989] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Dec 1 11:44:15 s0 kernel: [18446744060.026991] Call Trace:
(There is no trace - that's the last line.)
I don't seem to have the other errors anymore, but I'm quite sure the process mentioned above (
jed
) was different in the other dumps.What could cause this problem?
How to prevent this from happening?
Some extra info:
The value
17163091988
is a bit (pun intended) suspicious - it's1111111111000000000000000000010100
in binary. Maybe the error was trying to say 20 seconds (10100
in binary)?I'm not sure whether the problem persists with the latest 10.04 kernel (2.6.32-35).
I've also seen
task ... blocked for more than 120 seconds
problems - maybe they could be related?the vSphere client shows no alerts or migration tasks for the VM.
-
SaveTheRbtz over 12 yearsmaybe something wrong timekeeping? You can play with
clocksource
. Also CPUs' C-states is a good guess.
-
thinice over 12 yearsGood find - wonder if it trickles to debian
-
James about 12 yearsI just seen something that looks related in debian, 2.6.32-5-amd64 kernel showing two soft lock ups which is performing "oddly"