How to investigate cause of total hang?

21,123

Solution 1

Frederik's answer involving magic SysRq and kernel dumps will work if the kernel is still running, and not truly hung. The kernel might just be busy-looping for some reason.

The fact that it doesn't respond to Ctrl-Alt-Del tells me that probably isn't the case, and that the machine is locking up hard. That means hardware failure, or something closely related, like a bad driver.

Your memory check test is good, if you let it run long enough. You should also try other things to try and stress the system, like StressLinux. Long-running benchmarks are good, too.

Another thing to try is booting the system with an Ubuntu live CD and trying to use the system as normal. If returning to Ubuntu temporarily like that doesn't cause the problem to recur, there's a good chance it's not actually broken hardware, but one of the related things like a bad driver or incorrectly configured kernel. It is quite possible that a more popular distribution like Ubuntu could have a more stable kernel configuration than one like Arch, simply due to the greater number of machines it's been tried on during the distro's test phase.

Solution 2

Regarding the freeze, there are a few options:

  • using a serial port if your box has one to get the dump there by adding console=ttyS0 to the boot options, as described here. You need a second machine with a serial port and a null modem cable to catch the dump file.

  • using netconsole to get the dump over the network, see here.

  • Using kexec/kdump this way you get a local dump, see here.

Regarding the clean power off problem, I suggest you use the magic SysRq key to 'S'ync the discs, 'U'mount them, and then re'B'oot the box (the letters are the ones you should type along with alt-sysrq.

Edit: If you post the oops/trace to the lkml, you should use a recent (preferably the latest) version of the kernel and no proprietary modules.

Share:
21,123

Related videos on Youtube

DarenW
Author by

DarenW

Updated on September 17, 2022

Comments

  • DarenW
    DarenW almost 2 years

    My Arch machine sometimes hangs, suddenly not responding in any way to the mouse or the keyboard. The cursor is frozen. Ctrl-Alt-Backsp won't stop X11, and ctrl-alt-del does exactly nothing. The cpu, network, and disk activity plots in conky and icewm stop updating. In a few minutes the fan turns on. The only way to make the computer do anything at all is to turn off power.

    When it boots up, the CPU temperature monitors show 70 to 80C. Before the hang, I was usually doing low-intensity activity like web surfing getting around 50C.

    The logs show nothing special compared to a normal shutdown. Memory checker runs fine with zero defects.

    How can I investigate why it hung up? Is there extra information I can find for a clue? Is there anything less drastic than power-off to get some kind of action, if only some limited shell or just beeps, but might give a clue?

    The machine is a Gateway P6860 17" laptop (bulky but powerful) and it's running Arch 64bit, up to date (as of March 2011). I had Arch for a long time w/o this problem, switched to Ubuntu for about a week then retreated back to a fresh install of Arch. That's when the hangings started.

    UPDATE: Yeah, for sure it's overheating. At one temperature, the mouse and keyboard stop working, sometimes becoming functional after several minutes of cooling off. At a higher temperature, worse things happen, like total nonresponsiveness including ignoring SysRq. This condition is shortly followed by a sudden power-off. I have solved the problem by buying a new computer 8D

  • jpc
    jpc over 13 years
    I believ that Ctrl-Alt-Delete is handled by init so it may not work even if the kernel still does. OTOH AFAIR the kernel does not wait for SysRq keys after a panic.
  • Warren Young
    Warren Young over 13 years
    That's possible. To distinguish the cases, put ctrlaltdel hard in your /etc/rc.local file. When the system locks up, try Ctrl-Alt-Del. If it still does nothing, you know for sure that the kernel is no longer running; you have a hardware or driver failure.
  • jsbillings
    jsbillings over 13 years
    I've had kernels respond to Magic SysRq keys even though it was panicked. Proper setup of the kdump service should ensure that a completely wedged system boots into the kdump kernel, so it should eventually be back.
  • Warren Young
    Warren Young over 13 years
    After a quick poke through the kernel keyboard handling code, it looks to me like Ctrl-Alt-Del and magic SysRq are handled at the same level: if one works, the other will. The init(1)/SIGINT issue is separate, and is dealt with by setting the Ctrl-Alt-Del handling to do a hard reboot, as mentioned in my other comment.
  • DarenW
    DarenW over 13 years
    I can imagine a lot of young voices saying "What's a serial port, grandpa?" In fact, I don't think this machine even has one.
  • DarenW
    DarenW over 13 years
    I remember reading something about SysReq a few years ago. If only I could google it when the machine is dead! Guess I'd better get busy setting up a second machine...
  • John
    John about 4 years
    @WarrenYoung Could you give some hints on the related code snippet? Quote from your reply:"After a quick poke through the kernel keyboard handling code, it looks to me like Ctrl-Alt-Del and magic SysRq are handled at the same level: if one works, the other will."
  • John
    John about 4 years
    @Frederik Deweerdt Do you think netconsole still work under such conditions?