watchdog: BUG: soft lockup - CPU#6 stuck for 23s

41,220

Solution 1

Possible swap/memory problem.

BIOS

Your have BIOS version S1200SP.86B.03.01.0042.013020190050 dated 01/30/2019.

There's a newer BIOS available, dated June 2020, and it can be downloaded here.

Note: Have good backups before updating the BIOS.

Memtest

Go to https://www.memtest86.com/ and download/run their free memtest to test your memory. Get at least one complete pass of all the 4/4 tests to confirm good memory. This may take many hours to complete.

Update #1:

As I previously thought... you have swap problems.

You have THREE swap locations, as seen in /etc/fstab!

UUID="X-X-X-X-X" swap swap defaults 0 0
UUID="X-X-X-X-X" swap swap defaults 0 0
/swapfile swap swap defaults 0 0

Do sudo swapoff -a # turn off swap

Then comment out ALL three of the above lines in /etc/fstab.

It's never ok to completely disable swap. It's not appropriate to have too small of a swap. You have both problems.

Let's create an appropriate /swapfile for your system.

Note: Incorrect use of the dd command can cause data loss. Suggest copy/paste.

sudo swapoff -a           # turn off swap
sudo rm -i /swapfile      # remove old /swapfile

sudo dd if=/dev/zero of=/swapfile bs=1M count=4096

sudo chmod 600 /swapfile  # set proper file protections
sudo mkswap /swapfile     # init /swapfile
sudo swapon /swapfile     # turn on swap
free -h                   # confirm 32G RAM and 4G swap

Add this line to /etc/fstab...

/swapfile    none    swap    sw      0   0

Then reboot the system and verify operation.

If it all works, you can use gparted to delete the two disk partitions with the UUIDs shown in the commented out lines in /etc/fstab. Be careful here, and assure that you've got the correct partitions to delete. Then delete those three commented out lines in /etc/fstab.

Solution 2

I had this error on a VM in a locally run VM farm whose disks were full. The hypervisor was not able to allocate more space to "thin" disk partitions (these have physical space allocated on demand, and the farm was oversubscribed). Note that the hypervisor requires a certain overhead to run (perhaps 10%), and will reserve that space.

It turned out that one of the physical machines had had a problem and wasn't reporting freed up disk space, which lead to the VM farm halucinating that the disks were full. When that machine was rebooted, the problem went away. We're doing an OS and hypervisor update --- hopefully that will prevent the issue in the future.

Solution 3

Although the question seems answered, to anyone that finds themselves here with the same CPU error (in addition to heynnemas answer) check your PCI cable connections to any graphics cards you have connected.

I had the same errors and problems stopped after disconnecting a graphics card which I later realised had a faulty (and charred) 6-Pin connection. Replacing the cable returned system functions to normal.

I would also recommend checking CPU/memory timings are not too crazy and that the CPU cooler is attached correctly (tightly).

Share:
41,220
Dr.Blamo
Author by

Dr.Blamo

Updated on September 18, 2022

Comments

  • Dr.Blamo
    Dr.Blamo over 1 year

    I tried every solutions I found on google ... I can't find out why my server is crashing ...

    Aug  5 17:11:08  kernel: [ 2300.084576] watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [VM Thread:4054]
    Aug  5 17:11:08  kernel: [ 2300.084578] Modules linked in: veth nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo br_netfilter bridge stp llc rpcsec_gss_krb5 auth_rpcgss aufs nfsv4 nfs lockd grace fscache overlay isofs xt_nat xt_MASQUERADE xt_addrtype iptable_nat nf_nat xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ppdev kvm_intel kvm ipmi_si input_leds joydev ipmi_devintf ipmi_msghandler video parport_pc parport acpi_pad sch_fq_codel drm sunrpc ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid1 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel hid_generic aesni_intel crypto_simd cryptd glue_helper usbhid igb hid nvme dca ahci i2c_algo_bit nvme_core libahci
    Aug  5 17:11:08  kernel: [ 2300.084616] CPU: 6 PID: 4054 Comm: VM Thread Not tainted 5.4.0-42-generic #46-Ubuntu
    Aug  5 17:11:08  kernel: [ 2300.084616] Hardware name: Intel Corporation S1200SP/S1200SP, BIOS S1200SP.86B.03.01.0042.013020190050 01/30/2019
    Aug  5 17:11:08  kernel: [ 2300.084620] RIP: 0010:_raw_spin_lock+0x10/0x30
    Aug  5 17:11:08  kernel: [ 2300.084621] Code: ff 01 00 00 75 07 4c 89 e0 41 5c 5d c3 e8 f8 f9 62 ff 4c 89 e0 41 5c 5d c3 90 0f 1f 44 00 00 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 01 c3 55 89 c6 48 89 e5 e8 c2 e1 62 ff 66 90 5d c3 66 66 2e 0f
    Aug  5 17:11:08  kernel: [ 2300.084621] RSP: 0000:ffffa592c1bef760 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
    Aug  5 17:11:08  kernel: [ 2300.084622] RAX: 0000000000000000 RBX: 0000000000000100 RCX: ffff95314b79bc00
    Aug  5 17:11:08  kernel: [ 2300.084622] RDX: 0000000000000001 RSI: 0000000000000588 RDI: ffff953145c1aeac
    Aug  5 17:11:08  kernel: [ 2300.084623] RBP: ffffa592c1bef7b8 R08: ffff95314a5520f0 R09: 0000000000000000
    Aug  5 17:11:08  kernel: [ 2300.084623] R10: 0000000000000000 R11: ffffffffffffffb8 R12: 0000000000000000
    Aug  5 17:11:08  kernel: [ 2300.084623] R13: ffff953145c1ae00 R14: ffff95314b79bc00 R15: ffff953145c1aeac
    Aug  5 17:11:08  kernel: [ 2300.084624] FS:  00007fa0e4151700(0000) GS:ffff953151580000(0000) knlGS:0000000000000000
    Aug  5 17:11:08  kernel: [ 2300.084624] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Aug  5 17:11:08  kernel: [ 2300.084625] CR2: 0000000594832008 CR3: 000000045bf00003 CR4: 00000000003606e0
    Aug  5 17:11:08  kernel: [ 2300.084625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    Aug  5 17:11:08  kernel: [ 2300.084625] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Aug  5 17:11:08  kernel: [ 2300.084626] Call Trace:
    Aug  5 17:11:08  kernel: [ 2300.084628]  ? scan_swap_map_slots+0x3cd/0x510
    Aug  5 17:11:08  kernel: [ 2300.084629]  get_swap_pages+0x207/0x380
    Aug  5 17:11:08  kernel: [ 2300.084630]  ? rmap_walk_anon+0x16f/0x260
    Aug  5 17:11:08  kernel: [ 2300.084632]  get_swap_page+0xe3/0x210
    Aug  5 17:11:08  kernel: [ 2300.084633]  add_to_swap+0x1a/0x70
    Aug  5 17:11:08  kernel: [ 2300.084634]  shrink_page_list+0x4b3/0xbb0
    Aug  5 17:11:08  kernel: [ 2300.084648]  shrink_inactive_list+0x201/0x3e0
    Aug  5 17:11:08  kernel: [ 2300.084649]  shrink_node_memcg+0x137/0x370
    Aug  5 17:11:08  kernel: [ 2300.084650]  shrink_node+0xbd/0x400
    Aug  5 17:11:08  kernel: [ 2300.084650]  do_try_to_free_pages+0xd7/0x3a0
    Aug  5 17:11:08  kernel: [ 2300.084651]  try_to_free_mem_cgroup_pages+0xf4/0x210
    Aug  5 17:11:08  kernel: [ 2300.084653]  try_charge+0x2eb/0x810
    Aug  5 17:11:08  kernel: [ 2300.084654]  ? find_get_entry+0xaf/0x170
    Aug  5 17:11:08  kernel: [ 2300.084655]  mem_cgroup_try_charge+0x71/0x190
    Aug  5 17:11:08  kernel: [ 2300.084656]  ? pagecache_get_page+0x2d/0x300
    Aug  5 17:11:08  kernel: [ 2300.084657]  mem_cgroup_try_charge_delay+0x22/0x50
    Aug  5 17:11:08  kernel: [ 2300.084658]  do_swap_page+0x220/0x9f0
    Aug  5 17:11:08  kernel: [ 2300.084659]  __handle_mm_fault+0x73b/0x7a0
    Aug  5 17:11:08  kernel: [ 2300.084659]  handle_mm_fault+0xca/0x200
    Aug  5 17:11:08  kernel: [ 2300.084661]  do_user_addr_fault+0x1f9/0x450
    Aug  5 17:11:08  kernel: [ 2300.084662]  __do_page_fault+0x58/0x90
    Aug  5 17:11:08  kernel: [ 2300.084663]  do_page_fault+0x2c/0xe0
    Aug  5 17:11:08  kernel: [ 2300.084664]  page_fault+0x34/0x40
    Aug  5 17:11:08  kernel: [ 2300.084665] RIP: 0033:0x7fa168646be3
    Aug  5 17:11:08  kernel: [ 2300.084666] Code: 4c 89 6d b8 49 89 5d 00 49 c7 45 08 00 00 00 00 4c 3b 6d b0 0f 83 1d 01 00 00 4c 89 6d b0 49 89 dd 4d 39 fd 0f 83 bd 00 00 00 <49> 8b 45 00 4c 89 eb 83 e0 03 48 83 f8 03 0f 84 09 01 00 00 42 0f
    Aug  5 17:11:08  kernel: [ 2300.084666] RSP: 002b:00007fa0e41501b0 EFLAGS: 00010283
    Aug  5 17:11:08  kernel: [ 2300.084667] RAX: 00000005237c2908 RBX: 0000000000000004 RCX: 00007fa0e41504b0
    Aug  5 17:11:08  kernel: [ 2300.084667] RDX: 0000000000000004 RSI: 0000000594831fe8 RDI: 00007fa160745850
    Aug  5 17:11:08  kernel: [ 2300.084668] RBP: 00007fa0e4150230 R08: 00000005237c28e8 R09: 00007fa1607458f0
    Aug  5 17:11:08  kernel: [ 2300.084668] R10: 00007fa168f52d99 R11: 000000014b7bf600 R12: 00007fa1609924d0
    Aug  5 17:11:08  kernel: [ 2300.084668] R13: 0000000594832008 R14: 0000000000000240 R15: 0000000595000000
    

    I replaced my hardware but my disks.

    When I start a docker container (pterodactyl) with Minecraft sometimes it will freeze with the error above. I can't find some relevant logs ...

    uname -a : Linux X-X-X 5.4.0-42-generic #46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

    free -h : total used free shared buff/cache available Mem: 31Gi 594Mi 29Gi 4.0Mi 1.1Gi 30Gi Swap: 1.0Gi 0B 1.0Gi

    sysctl vm.swappiness : vm.swappiness = 60

    sudo lshw -C memory :

      *-firmware
           description: BIOS
           vendor: Intel Corporation
           physical id: 6
           version: S1200SP.86B.03.01.0042.013020190050
           date: 01/30/2019
           size: 64KiB
           capacity: 16MiB
           capabilities: pci pnp upgrade shadowing cdboot bootselect edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer int10video acpi usb ls120boot zipboot biosbootspecification netboot uefi
      *-cache:0
           description: L1 cache
           physical id: 1a
           slot: L1 Cache
           size: 128KiB
           capacity: 128KiB
           capabilities: synchronous internal write-through instruction
           configuration: level=1
      *-cache:1
           description: L2 cache
           physical id: 1b
           slot: L2 Cache
           size: 1MiB
           capacity: 1MiB
           capabilities: synchronous internal write-through unified
           configuration: level=2
      *-cache:2
           description: L3 cache
           physical id: 1c
           slot: L3 Cache
           size: 8MiB
           capacity: 8MiB
           capabilities: synchronous internal write-back unified
           configuration: level=3
      *-cache
           description: L1 cache
           physical id: 19
           slot: L1 Cache
           size: 128KiB
           capacity: 128KiB
           capabilities: synchronous internal write-through data
           configuration: level=1
      *-memory
           description: System Memory
           physical id: 1e
           slot: System board or motherboard
           size: 32GiB
         *-bank:0
              description: [empty]
              vendor: Empty/NO DIMM
              physical id: 0
              slot: DIMM_A1
         *-bank:1
              description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
              product: KHX2400C15/16G
              vendor: Kingston
              physical id: 1
              serial: A800F9241
              slot: DIMM_A2
              size: 16GiB
              width: 64 bits
              clock: 2400MHz (0.4ns)
         *-bank:2
              description: [empty]
              vendor: Empty/NO DIMM
              physical id: 2
              slot: DIMM_B1
         *-bank:3
              description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
              product: KHX2400C15/16G
              vendor: Kingston
              physical id: 3
              serial: BE305496
              slot: DIMM_B2
              size: 16GiB
              width: 64 bits
              clock: 2400MHz (0.4ns)
      *-memory UNCLAIMED
           description: Memory controller
           product: 100 Series/C230 Series Chipset Family Power Management Controller
           vendor: Intel Corporation
           physical id: 1f.2
           bus info: pci@0000:00:1f.2
           version: 31
           width: 32 bits
           clock: 33MHz (30.3ns)
           capabilities: bus_master
           configuration: latency=0
           resources: memory:a2f10000-a2f13fff
    

    grep -i swap /etc/fstab :

    UUID="X-X-X-X-X" swap swap defaults 0 0
    UUID="X-X-X-X-X" swap swap defaults 0 0
    /swapfile swap swap defaults 0 0
    

    Any ideas ?

    • Boris Hamanov
      Boris Hamanov almost 4 years
      Edit your question and show me free -h and sysctl vm.swappiness and sudo lshw -C memory. Start comments to me with @heynnema or I'll miss them.
    • Dr.Blamo
      Dr.Blamo almost 4 years
      @heynnema done.
    • Boris Hamanov
      Boris Hamanov almost 4 years
      Your swap is too small. Edit your question and show me grep -i swap /etc/fstab and I'll update my answer.
    • Dr.Blamo
      Dr.Blamo almost 4 years
      I fixed the issue by removing a swapfile. I disabled the swap with the command swapoff.
    • Dr.Blamo
      Dr.Blamo almost 4 years
      @heynnema I edited the question with the output of grep -i swap /etc/fstab
    • Boris Hamanov
      Boris Hamanov almost 4 years
      @Dr.Blamo Please see Update #1 in my answer. Report back.
    • Boris Hamanov
      Boris Hamanov almost 4 years
      Status please...
    • Dr.Blamo
      Dr.Blamo almost 4 years
      Sorry for the delay, I will check this in few hours. Thanks for the update, I will try this ASAP. I will reply after the tests.
    • Boris Hamanov
      Boris Hamanov almost 4 years
      Status please...