Crash during startup on a recent corporate computer

5,675

The problem

It turns out that my problem is a known issue between the latest Intel microcode on (some?) Skylake CPUs and recent Linux kernels, which is mainly triggered by sssd. See Ubuntu bug #1759920 “intel-microcode 3.20180312.0 causes lockup at login screen(w/ linux-image-4.13.0-37-generic)”, and also a number of other bugs which turn out to be about the same issue, such as Ubuntu bug #1746806 “sssd appears to crash AWS c5 and m5 instances, cause 100% CPU” and Ubuntu bug #1746418 “System freezes when starting Xorg after installing linux-image-4.13.0-32-generic”. You are likely to encounter this bug if:

  • You have a very recent Intel CPU. As far as I can tell, this bug only arises on Skylake CPUs.
  • You have the intel-microcode package installed. Reverting to an earlier, tested kernel didn't work for me because I'd only run that kernel with an earlier microcode.
  • Your computer is connected to a corporate network (typically LDAP or Active Directory) for user authentication. Although there are other ways to trigger the bug, running sssd seems to be the most common culprit. There are also reports of Xorg crashing.

The bug is due to mitigations for the Spectre security issue that was published in January 2018. There's an incompatibility between some kernel code and some processor microcode that causes a lock-up in certain circumstances.

How to repair

  1. If you can't boot normally, you'll need to edit the kernel command line at the Grub prompt. See the question for explanations and possible ways to get a root shell.
  2. A workaround for this specific bug is to add the noibpb parameter to the kernel command line (1746418/14, 1759920/56). This should let you boot normally and perform some repairs.
    This disables the vulnerability mitigation that causes the problem, which means that your computer is now vulnerable to some attacks. They're local attacks, i.e. the attacker needs to run code on your machine, but these attacks may potentially be carried out e.g. through JavaScript in a web browser.
    If you don't have any other way, you can make this permanent by adding noibpb to the kernel command line until you can get a fixed kernel.
  3. In Ubuntu, the fix is expected on the week of 23 April 2018, in what will presumably be kernel 4.4.0-117 and 4.13.0-39. In the meantime, Tyler Hicks has published test kernels for 4.4 and 4.13.

How I diagnosed the issue

I tried several things (see the question) and determined that the bug was triggered somewhere between reaching basic.target and reaching multi-user.target. So I set the default systemd target to basic.target (systemctl set-default basic.target) and enabled the debug-shell service (systemctl enable debug-shell) to get a root shell.

I ran systemctl list-dependencies multi-user.target and manually started the listed dependencies one by one. This did not trigger the crash.

Not all services are managed directly by systemd. Some are managed as Upstart services and some are managed as SysVinit scripts. The shell script below runs all of them. Note: I only tested it once, and it crashed by design.

#!/bin/sh
wants=$(systemctl show -p Wants multi-user.target | sed 's/^Wants=//' | tr ' ' '\n' | sort)
log=/var/tmp/multi-user-steps-$(date +%Y%m%d-%H%M%S)

log () {
  echo "$* ..." | tee -a "$log"
  sync
  "$@"
  ret=$?
  echo "$* -> $ret" | tee -a "$log"
  sync
  return $ret
}

# systemd services
for service in $wants; do
  log systemctl start $service
  sleep 2
done

# upstart services
for conf in /etc/init/*.conf; do
  service=${conf##*/}; service=${service%.conf}
  log service ${service} start
  sleep 2
done

# sysvinit services
for service in /etc/rc3.d/S*; do
  log ${service} start
  sleep 2
done

My computer crashed after starting sssd. From there, a web search on “sssd linux kernel hang” led me to https://bugs.launchpad.net/cloud-images/+bug/1746806 and to the diagnosis and solution.

Share:
5,675

Related videos on Youtube

Gilles 'SO- stop being evil'
Author by

Gilles 'SO- stop being evil'

Updated on September 18, 2022

Comments

  • Gilles 'SO- stop being evil'
    Gilles 'SO- stop being evil' over 1 year

    After some recent updates, my computer no longer boots! Here's what I could determine:

    • This is a very recent computer that was provided to me by corporate IT. It has a recent Intel CPU (Skylake generation).
    • The computer runs Ubuntu 16.04.
    • The computer last booted correctly some time in March. The problem is presumably due to a software update or a hardware bug.
    • I have another computer running 16.04 with pretty much the same software installed (I used apt-clone), and it works just fine. It has different hardware (also amd64, but different CPU, different GPU, etc.).
    • The kernel does start, the initrd works correctly. When I boot with a splash screen in graphics mode, I get prompted for the password for my dm-crypt volume, and the last thing I see is that it's mounted successfully.
    • The hang occurs before I get a login prompt. When the computer hangs, it's a hard hang. Even Alt+SysRq doesn't respond. The CPU is evidently pegged at 100% since the fans turn on at full blast.
    • I still have the kernel I was running before rebooting. When I select this kernel in the Grub menu, I get the same lockup. So it looks like this is a pre-existing kernel bug which gets triggered by something else — but what?
    • If I switch off the splash screen (remove splash from the linux command line in Grub), I see a number of services starting, then it locks up.
    • I can get a root shell by adding init=/bin/sh to the linux command line in Grub. I can even get further by adding

      systemd.unit=basic.target systemd.shell
      

      This starts a number of services and runs a root shell on tty9.

    • If I run systemctl start multi-user.target from that root shell, the computer locks up. So presumably the problem is triggered by one of these services.
    • I ran systemctl list-dependencies multi-user.target to see what services get started. I manually started the listed dependencies one by one, and everything started just fine.

    So this looks like a hardware bug (since it occurs on one computer but not on the other one) that gets triggered by some software. But what software? Since the computer locks up so hard, I can't get any logs. I can't even get any useful console output.


    Useful debugging techniques:

    • Alt+SysRq: magic SysRq key, which lets you do things such as an emergency reboot. It accesses the kernel at a very low level, so it works in all but the worst crashes. In my case, Alt+SysRq doesn't respond, which shows how deep the crash goes.
    • To modify the boot parameters, press and hold Shift a few seconds after switching the power on. You need to press it after the BIOS has initialized the keyboard, but before the operating system boots. This makes the Grub menu appear.
    • At the Grub menu, press e to edit the command line for a menu entry. To change the Linux boot parameters, navigate to the line that starts with linux. On a modern Ubuntu, you'll find old kernels under “Advanced options for Ubuntu”. Once you've made the desired changes to the command line, press Ctrl+x to boot. Any change you make here are for this boot only, they aren't saved to disk.
    • Some useful options on the linux command line:
      • quiet nosplash hides almost all boot messages. Remove them to get messages on the console during boot, which is necessary to have any chance of diagnosing problems.
      • recovery gives you a root shell with almost no services. You'll need to know the root password. The “recovery mode” menu entry uses this.
      • init=/bin/sh gives you a root shell with no services at all. To resume normal boot, run exec init. You can pass systemd options at this point, e.g. exec init --unit=basic.target to start init and a few services (note that this does not start any way to log in, so you'd better have a shell running on another console). Note that the root filesystem is mounted read-only; run mount -o remount,rw / to be able to write to it.
      • systemd.unit=basic.target starts a very basic set of services. Note that this does not include any way to log in! You can make this the default by running systemctl set-default basic.target at a root prompt. To restore the original default target, run systemctl set-default graphical.target (or systemctl set-default multi-user.target for a server with no GUI).
      • systemd.debug-shell starts a root shell on tty9. You can enable this for every boot by running systemctl enable debug-shell at a root prompt. Don't forget to disable this after you've solved the problem with systemctl disable debug-shell. Press Alt+F9 to switch to tty9.
      • See also Fedora systemd tips, Arch Linux boot problem tips.
  • Tonny
    Tonny about 6 years
    I ran into this one as well. I removed the intel-microcode package and blacklisted it in apt to prevent it being re-installed. The micro-code that causes the issues isn't added permanently to the CPU. It is re-loaded every time. So not loading it will also act as a work-around. The noipbp isn't needed in that case and you will still get the mitigations. In my case a necessity as this system is most of the time directly internet facing without the added protection of the corporate proxy-servers.
  • Tonny
    Tonny about 6 years
    I know and I agree. But the new kernels aren't here yet and for the time being I prefer a working system with most mitigations (except the microcode) to a system with microcode, but no software mitigations (which cover more than the microcode) at all. Regarding the microcode updates: For these new Skylakes it seems that the Spectre/Meltdown fixes are the only microcode updates so far so we don't seem to miss out on much without them. For older CPU's it is another matter. There are lot's of CPU errata fixed with microcode updates. And I really would be loath to go without those.