Determining cause of Linux kernel panic

kernel kernel-modules kernel-panic crash

142,943

Solution 1

I have two suggestions to start.

The first you're not going to like. No matter how stable you think your overclocked system is, it would be my first suspect. And any developer you report the problem to will say the same thing. Your stable test workload isn't necessarily using the same instructions, stressing the memory subsystem as much, whatever. Stop overclocking. If you want people to believe the problem's not overclocking, then make it happen when not overclocking so you can get a clean bug report. This will make a huge difference in how much effort other people will invest in solving this problem. Having bug-free software is a point of pride, but reports from people with particularly questionable hardware setups are frustrating time-sinks that probably don't involve a real bug at all.

The second is to get the oops data, which as you've noticed doesn't go to any of the places you've mentioned. If the crash only happens while running X11, I think local console is pretty much out (it's a pain anyway), so you need to do this over a serial console, over the network, or by saving to local disk (which is trickier than it may sound because you don't want an untrustworthy kernel to corrupt your filesystem). Here are some ways to do this:

use netdump to save to a server over the network. I haven't done this in years, so I'm not sure this software is still around and working with modern kernels, but it's easy enough that it's worth a shot.
boot using a serial console; you'll need a serial port free on both machines (whether an old-school one or a USB serial adapter) and a null modem cable; you'd configure the other machine to save the output.
kdump seems to be what the cool kids use nowadays, and seems quite flexible, although it wouldn't be my preference because it looks complex to set up. In short, it involves booting a different kernel that can do anything and inspect the former kernel's memory contents, but you have to essentially build the whole process and I don't see a lot of canned options out there. Update: There are some nice distro things, actually; on Ubuntu, linux-crashdump

Once you get the debug info, there's a tool called ksymoops that you can use to turn the addresses into symbol names and start getting an idea how your kernel crashed. And if the symbolized dump doesn't mean anything to you, at least this is something helpful to report here or perhaps on your Linux distribution's mailing list / bug tracker.

From crash on your crashdump, you can try typing log and bt to get a bit more information (things logged during the panic and a stack backtrace). Your Fatal Machine check seems to be coming from here, though. From skimming the code, your processor has reported a Machine Check Exception - a hardware problem. Again, my first bet would be due to overclocking. It seems like there might be a more specific message in the log output which could tell you more.

Also from that code, it looks like if you boot with the mce=3 kernel parameter, it will stop crashing...but I wouldn't really recommend this except as a diagnostic step. If the Linux kernel thinks this error is worth crashing over, it's probably right.

Solution 2

a) Check if kernel messages are being logged to a file by rsyslog daemon

vi /etc/rsyslog.conf

And add the following

kern.*                 /var/log/kernel.log

Restart the rsyslog service.

/etc/initd.d/rsyslog restart

b) Take a note of the loaded modules

`lsmod >/your/home/dir`

c) As the panic is not reproducible, wait for it to happen

d) Once the panic has occurred, boot the system using a live or emergency CD

e) Mount the filesystems (usually / will suffice if /var and /home are not separate file systems) of the affected system (pvs, vgs, lvs commands need to be run if you are using LVM on the affected system to bring up the LV) mount -t ext4 /dev/sdXN /mnt

f) Go to /mnt/var/log/ directory and check the kernel.log file. This should give you enough information to figure out if the panic is happening for a particular module or something else.

Solution 3

Is your processor overclocked? I had this same issue today when I was playing with the multiplier in the over-clocking menu in my BIOS; various multipliers around 20x would cause this to happen. I reduced it down to 18.5x (3.7GHz) and the problem went away; I think it was a motherboard/power issue.

Solution 4

Most definitely a processor issue, notice the lines that say: TSC 539b174dead ADDR 3fe98d264ebd MISC 1 [ 1561.519950] [Hardware Error]: PROCESSOR 0:206a7 TIME 1357862746 SOCKET 0 APIC 1 microcode 28. Processor 0 is what the kernel used to process the crash (matters in multi-cpu systems) and socket 0 is the offending processor (though I assume you only have 1). Either it is bad or as you noted being overclocked cause for faults. I know you said you took it through prime95 but since I do not have more information on how old the system is I am grabbing at a few straws, how does your thermal paste look, and have you checked to make sure your LGA (under the CPU) looks alright? I am thinking maybe bent pins or some paste under the LGA. Again just root causing here.

If that fails to fix the issue there is a little trick you can do to use your SMBIOS to find where the panic hits exactly, another line (TSC 539b174de9d ADDR 3fe98d264ebd MISC 1) is basically SMBIOS data that can show where the crash happened. When your machine is up, in command line run, echo "TSC 539b174de9d ADDR 3fe98d264ebd MISC 1" | sudo mcelog --ascii --dmi to get the output, this will tell you it is a hardware error and even what DIMM it was processing on, this can point to a faulty DIMM or bus path, if the DIMM failure jumps around with every crash however, this points to the CPU.

View more solutions

142,943

Author by

Naftuli Kay

Updated on September 18, 2022

Comments

Naftuli Kay almost 2 years

I'm running an Ubuntu 12.04 derivative (amd64) and I've been having really strange issues recently. Out of the blue, seemingly, X will freeze completely for a while (1-3 minutes?) and then the system will reboot. This system is overclocked, but very stable as verified in Windows, which leads me to believe I'm having a kernel panic or an issue with one of my modules. Even in Linux, I can run LINPACK and won't see a crash despite putting ridiculous load on the CPU. Crashes seem to happen at random times, even when the machine is sitting idle.

How can I debug what's crashing the system?

On a hunch that it might be the proprietary NVIDIA driver, I reverted all the way down to the stable version of the driver, version 304 and I still experience the crash.

Can anyone walk me through a good debugging procedure for after a crash? I'd be more than happy to boot into a thumb drive and post all of my post-crash configuration files, I'm just not sure what they would be. How can I find out what's crashing my system?

Here are a bunch of logs, the usual culprits.

.xsession-errors: http://pastebin.com/EEDtVkVm

/var/log/Xorg.0.log: http://pastebin.com/ftsG5VAn

/var/log/kern.log: http://pastebin.com/Hsy7jcHZ

/var/log/syslog: http://pastebin.com/9Fkp3FMz

I can't even seem to find a record of the crash at all.

Triggering the crash is not so simple, it seem to happen when the GPU is trying to draw multiple things at once. If I put on a YouTube video in full screen and let it repeat for a while or scroll through a ton of GIFs and a Skype notification pops up, sometimes it'll crash. Totally scratching my head on this one.

The CPU is overclocked to 4.8GHz, but it's completely stable and has survived huge LINPACK runs and 9 hours of Prime95 yesterday without a single crash.

Update

I've installed kdump, crash, and linux-crashdump, as well as the kernel debug symbols for my kernel version 3.2.0-35. When I run apport-unpack on the crashed kernel file and then crash on the VmCore crash dump, here's what I see:

      KERNEL: /usr/lib/debug/boot/vmlinux-3.2.0-35-generic
    DUMPFILE: Downloads/crash/VmCore
        CPUS: 8
        DATE: Thu Jan 10 16:05:55 2013
      UPTIME: 00:26:04
LOAD AVERAGE: 2.20, 0.84, 0.49
       TASKS: 614
    NODENAME: mightymoose
     RELEASE: 3.2.0-35-generic
     VERSION: #55-Ubuntu SMP Wed Dec 5 17:42:16 UTC 2012
     MACHINE: x86_64  (3499 Mhz)
      MEMORY: 8 GB
       PANIC: "[ 1561.519960] Kernel panic - not syncing: Fatal Machine check"
         PID: 0
     COMMAND: "swapper/5"
        TASK: ffff880211251700  (1 of 8)  [THREAD_INFO: ffff880211260000]
         CPU: 5
       STATE: TASK_RUNNING (PANIC)

When I run log from the crash utility, I see this at the bottom of the log:

[ 1561.519943] [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 3: be00000000800400
[ 1561.519946] [Hardware Error]: RIP !INEXACT! 33:<00007fe99ae93e54> 
[ 1561.519948] [Hardware Error]: TSC 539b174dead ADDR 3fe98d264ebd MISC 1 
[ 1561.519950] [Hardware Error]: PROCESSOR 0:206a7 TIME 1357862746 SOCKET 0 APIC 1 microcode 28
[ 1561.519951] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 1561.519953] [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 3: be00000000800400
[ 1561.519955] [Hardware Error]: TSC 539b174de9d ADDR 3fe98d264ebd MISC 1 
[ 1561.519957] [Hardware Error]: PROCESSOR 0:206a7 TIME 1357862746 SOCKET 0 APIC 0 microcode 28
[ 1561.519958] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 1561.519959] [Hardware Error]: Machine check: Processor context corrupt
[ 1561.519960] Kernel panic - not syncing: Fatal Machine check
[ 1561.519962] Pid: 0, comm: swapper/5 Tainted: P   M     C O 3.2.0-35-generic #55-Ubuntu
[ 1561.519963] Call Trace:
[ 1561.519964]  <#MC>  [<ffffffff81644340>] panic+0x91/0x1a4
[ 1561.519971]  [<ffffffff8102abeb>] mce_panic.part.14+0x18b/0x1c0
[ 1561.519973]  [<ffffffff8102ac80>] mce_panic+0x60/0xb0
[ 1561.519975]  [<ffffffff8102aec4>] mce_reign+0x1f4/0x200
[ 1561.519977]  [<ffffffff8102b175>] mce_end+0xf5/0x100
[ 1561.519979]  [<ffffffff8102b92c>] do_machine_check+0x3fc/0x600
[ 1561.519982]  [<ffffffff8136d48f>] ? intel_idle+0xbf/0x150
[ 1561.519984]  [<ffffffff8165d78c>] machine_check+0x1c/0x30
[ 1561.519986]  [<ffffffff8136d48f>] ? intel_idle+0xbf/0x150
[ 1561.519987]  <<EOE>>  [<ffffffff81509697>] ? menu_select+0xe7/0x2c0
[ 1561.519991]  [<ffffffff815082d1>] cpuidle_idle_call+0xc1/0x280
[ 1561.519994]  [<ffffffff8101322a>] cpu_idle+0xca/0x120
[ 1561.519996]  [<ffffffff8163aa9a>] start_secondary+0xd9/0xdb

bt outputs the backtrace:

PID: 0      TASK: ffff880211251700  CPU: 5   COMMAND: "swapper/5"
 #0 [ffff88021ed4aba0] machine_kexec at ffffffff8103947a
 #1 [ffff88021ed4ac10] crash_kexec at ffffffff810b52c8
 #2 [ffff88021ed4ace0] panic at ffffffff81644347
 #3 [ffff88021ed4ad60] mce_panic.part.14 at ffffffff8102abeb
 #4 [ffff88021ed4adb0] mce_panic at ffffffff8102ac80
 #5 [ffff88021ed4ade0] mce_reign at ffffffff8102aec4
 #6 [ffff88021ed4ae40] mce_end at ffffffff8102b175
 #7 [ffff88021ed4ae70] do_machine_check at ffffffff8102b92c
 #8 [ffff88021ed4af50] machine_check at ffffffff8165d78c
    [exception RIP: intel_idle+191]
    RIP: ffffffff8136d48f  RSP: ffff880211261e38  RFLAGS: 00000046
    RAX: 0000000000000020  RBX: 0000000000000008  RCX: 0000000000000001
    RDX: 0000000000000000  RSI: ffff880211261fd8  RDI: ffffffff81c12f00
    RBP: ffff880211261e98   R8: 00000000fffffffc   R9: 0000000000000f9f
    R10: 0000000000001e95  R11: 0000000000000000  R12: 0000000000000003
    R13: ffff88021ed5ac70  R14: 0000000000000020  R15: 12d818fb42cfe42b
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <MCE exception stack> ---
 #9 [ffff880211261e38] intel_idle at ffffffff8136d48f
#10 [ffff880211261ea0] cpuidle_idle_call at ffffffff815082d1
#11 [ffff880211261f00] cpu_idle at ffffffff8101322a

Any ideas?

Naftuli Kay over 11 years

If the overclock is the problem, I'll be able to see a clock cycle get missed in crash logs, so at the end of the day, I'll know what the problem is. That's my goal: to figure out what's going wrong. If it's my overclock, then fine, I'd just like to know what the problem is.
Scott Lamb over 11 years

I don't think overclocking failures are as obvious as that to spot in the logs; I'm not a processor expert, but it's not like the whole processor correctly handles the clock cycle or indicates to the OS somehow that it missed it. Let me know if you have trouble getting logs, but IMHO by far the easiest way to know if it's an overclocking problem is to see if it happens when not overclocking.
Naftuli Kay over 11 years

Okay, I'll do that after backing up my settings. I might first just see if I can reproduce the crash in Windows.
Naftuli Kay over 11 years

While I'm thankful to never ever encounter a BSOD in Linux, it would seem strange to me that while Windows would log and display a problem, Linux wouldn't be able to.
Scott Lamb over 11 years

One of those little quirks. :-/ There's no fundamental reason Ubuntu or RedHat couldn't set up a nice kdump-based system for crash logging and display out of the box, but no one's done it as far as I know.
Naftuli Kay over 11 years

Log results from that are pretty inconclusive: pastebin.com/VdYAHgiH
Scott Lamb over 11 years

Actually, I take that back. On Ubuntu, there is a linux-crashdump package you can install fairly easily to automatically put crashes in /var/crash. What distribution are you using?
Naftuli Kay over 11 years

I've updated the question, as I was able to crash the machine while running linux-crashdump and obtain a crash dump file which hopefully has enough information to determine the cause.
Scott Lamb over 11 years

Sweet. Updated my answer as well.
Naftuli Kay over 11 years

Thanks, I'll look into that. I've heard that this issue pops up sometimes on UEFI motherboards when booting into BIOS legacy mode, which is the case on this system. This could explain why I haven't seen the issue on Windows, as it boots EFI. I'm also running i7z as a daemon in the background and it's probably doing some devious stuff to get live processor frequencies, C-states, and other stuff. Needless to say, I've disabled that and I'll see if it crashes again.
Naftuli Kay over 11 years

Got the log! Awesome help. I've updated the original post with the output of that log. I'm finally seeing the error now, any ideas on what might be causing it?
Scott Lamb over 11 years

All it means to me is that your processor isn't working. Probably the overclocking, maybe the other thing you mentioned (it's not something I've heard about), maybe a defective unit.
user3002906 over 11 years

I would second the "Overclock is culprit" thought. MCE mostly occurs due to hardware issues. But, a segmentation fault in any module code can cause the same too. Two years back, my new i7 2600k was giving me the same MCE issue, even when I was not doing anything on the computer. When I dug a little deeper, I found the BIOS version I was using with my Intel motherboard was not properly supporting the then new processor. I updated the BIOS and the problem was gone. So I will suggest you to check on that route too.
Naftuli Kay over 11 years

Now that I know what's failing, is there a way for me to cause the crash with a given command?
Scott Lamb over 11 years

Not sure. I'm a software guy; this is the limit of my expertise.
Naftuli Kay about 11 years

Yes, it had everything to do with overclocking. Evidently, Windows seems to be a bit more fault-tolerant with certain processor faults, if the CPU can keep going. I might start booting with mce=3 to prevent crashing, but in the past, I've simply increased the voltage each time it's crashed (which hasn't been so often). Something to note is that I'm using an offset voltage, which is generally speaking more unstable.
dma_k about 9 years

As to my experience, kernel crashes rarely get into kernel.log, as log information needs to go a pretty long way via syslog, filesystem driver, disk cache and disk driver. Most simple and elegant way is to use netconsole kernel module.