Determining cause of Linux kernel panic
Solution 1
I have two suggestions to start.
The first you're not going to like. No matter how stable you think your overclocked system is, it would be my first suspect. And any developer you report the problem to will say the same thing. Your stable test workload isn't necessarily using the same instructions, stressing the memory subsystem as much, whatever. Stop overclocking. If you want people to believe the problem's not overclocking, then make it happen when not overclocking so you can get a clean bug report. This will make a huge difference in how much effort other people will invest in solving this problem. Having bug-free software is a point of pride, but reports from people with particularly questionable hardware setups are frustrating time-sinks that probably don't involve a real bug at all.
The second is to get the oops data, which as you've noticed doesn't go to any of the places you've mentioned. If the crash only happens while running X11, I think local console is pretty much out (it's a pain anyway), so you need to do this over a serial console, over the network, or by saving to local disk (which is trickier than it may sound because you don't want an untrustworthy kernel to corrupt your filesystem). Here are some ways to do this:
- use netdump to save to a server over the network. I haven't done this in years, so I'm not sure this software is still around and working with modern kernels, but it's easy enough that it's worth a shot.
- boot using a serial console; you'll need a serial port free on both machines (whether an old-school one or a USB serial adapter) and a null modem cable; you'd configure the other machine to save the output.
- kdump seems to be what the cool kids use nowadays, and seems quite flexible, although it wouldn't be my preference because it looks complex to set up. In short, it involves booting a different kernel that can do anything and inspect the former kernel's memory contents, but you have to essentially build the whole process and I don't see a lot of canned options out there. Update: There are some nice distro things, actually; on Ubuntu, linux-crashdump
Once you get the debug info, there's a tool called ksymoops that you can use to turn the addresses into symbol names and start getting an idea how your kernel crashed. And if the symbolized dump doesn't mean anything to you, at least this is something helpful to report here or perhaps on your Linux distribution's mailing list / bug tracker.
From crash
on your crashdump, you can try typing log
and bt
to get a bit more information (things logged during the panic and a stack backtrace). Your Fatal Machine check
seems to be coming from here, though. From skimming the code, your processor has reported a Machine Check Exception - a hardware problem. Again, my first bet would be due to overclocking. It seems like there might be a more specific message in the log
output which could tell you more.
Also from that code, it looks like if you boot with the mce=3
kernel parameter, it will stop crashing...but I wouldn't really recommend this except as a diagnostic step. If the Linux kernel thinks this error is worth crashing over, it's probably right.
Solution 2
a) Check if kernel messages are being logged to a file by rsyslog daemon
vi /etc/rsyslog.conf
And add the following
kern.* /var/log/kernel.log
Restart the rsyslog
service.
/etc/initd.d/rsyslog restart
b) Take a note of the loaded modules
`lsmod >/your/home/dir`
c) As the panic is not reproducible, wait for it to happen
d) Once the panic has occurred, boot the system using a live or emergency CD
e) Mount the filesystems (usually / will suffice if /var and /home are not separate file systems) of the affected system (pvs
, vgs
, lvs
commands need to be run if you are using LVM on the affected system to bring up the LV)
mount -t ext4 /dev/sdXN /mnt
f) Go to /mnt/var/log/
directory and check the kernel.log
file. This should give you enough information to figure out if the panic is happening for a particular module or something else.
Solution 3
Is your processor overclocked? I had this same issue today when I was playing with the multiplier in the over-clocking menu in my BIOS; various multipliers around 20x would cause this to happen. I reduced it down to 18.5x (3.7GHz) and the problem went away; I think it was a motherboard/power issue.
Solution 4
Most definitely a processor issue, notice the lines that say: TSC 539b174dead ADDR 3fe98d264ebd MISC 1 [ 1561.519950] [Hardware Error]: PROCESSOR 0:206a7 TIME 1357862746 SOCKET 0 APIC 1 microcode 28. Processor 0 is what the kernel used to process the crash (matters in multi-cpu systems) and socket 0 is the offending processor (though I assume you only have 1). Either it is bad or as you noted being overclocked cause for faults. I know you said you took it through prime95 but since I do not have more information on how old the system is I am grabbing at a few straws, how does your thermal paste look, and have you checked to make sure your LGA (under the CPU) looks alright? I am thinking maybe bent pins or some paste under the LGA. Again just root causing here.
If that fails to fix the issue there is a little trick you can do to use your SMBIOS to find where the panic hits exactly, another line (TSC 539b174de9d ADDR 3fe98d264ebd MISC 1) is basically SMBIOS data that can show where the crash happened. When your machine is up, in command line run, echo "TSC 539b174de9d ADDR 3fe98d264ebd MISC 1" | sudo mcelog --ascii --dmi to get the output, this will tell you it is a hardware error and even what DIMM it was processing on, this can point to a faulty DIMM or bus path, if the DIMM failure jumps around with every crash however, this points to the CPU.
Naftuli Kay
Updated on September 18, 2022Comments
-
Naftuli Kay almost 2 years
I'm running an Ubuntu 12.04 derivative (amd64) and I've been having really strange issues recently. Out of the blue, seemingly, X will freeze completely for a while (1-3 minutes?) and then the system will reboot. This system is overclocked, but very stable as verified in Windows, which leads me to believe I'm having a kernel panic or an issue with one of my modules. Even in Linux, I can run LINPACK and won't see a crash despite putting ridiculous load on the CPU. Crashes seem to happen at random times, even when the machine is sitting idle.
How can I debug what's crashing the system?
On a hunch that it might be the proprietary NVIDIA driver, I reverted all the way down to the stable version of the driver, version 304 and I still experience the crash.
Can anyone walk me through a good debugging procedure for after a crash? I'd be more than happy to boot into a thumb drive and post all of my post-crash configuration files, I'm just not sure what they would be. How can I find out what's crashing my system?
Here are a bunch of logs, the usual culprits.
.xsession-errors: http://pastebin.com/EEDtVkVm
/var/log/Xorg.0.log: http://pastebin.com/ftsG5VAn
/var/log/kern.log: http://pastebin.com/Hsy7jcHZ
/var/log/syslog: http://pastebin.com/9Fkp3FMz
I can't even seem to find a record of the crash at all.
Triggering the crash is not so simple, it seem to happen when the GPU is trying to draw multiple things at once. If I put on a YouTube video in full screen and let it repeat for a while or scroll through a ton of GIFs and a Skype notification pops up, sometimes it'll crash. Totally scratching my head on this one.
The CPU is overclocked to 4.8GHz, but it's completely stable and has survived huge LINPACK runs and 9 hours of Prime95 yesterday without a single crash.
Update
I've installed
kdump
,crash
, andlinux-crashdump
, as well as the kernel debug symbols for my kernel version 3.2.0-35. When I runapport-unpack
on the crashed kernel file and thencrash
on theVmCore
crash dump, here's what I see:KERNEL: /usr/lib/debug/boot/vmlinux-3.2.0-35-generic DUMPFILE: Downloads/crash/VmCore CPUS: 8 DATE: Thu Jan 10 16:05:55 2013 UPTIME: 00:26:04 LOAD AVERAGE: 2.20, 0.84, 0.49 TASKS: 614 NODENAME: mightymoose RELEASE: 3.2.0-35-generic VERSION: #55-Ubuntu SMP Wed Dec 5 17:42:16 UTC 2012 MACHINE: x86_64 (3499 Mhz) MEMORY: 8 GB PANIC: "[ 1561.519960] Kernel panic - not syncing: Fatal Machine check" PID: 0 COMMAND: "swapper/5" TASK: ffff880211251700 (1 of 8) [THREAD_INFO: ffff880211260000] CPU: 5 STATE: TASK_RUNNING (PANIC)
When I run
log
from thecrash
utility, I see this at the bottom of the log:[ 1561.519943] [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 3: be00000000800400 [ 1561.519946] [Hardware Error]: RIP !INEXACT! 33:<00007fe99ae93e54> [ 1561.519948] [Hardware Error]: TSC 539b174dead ADDR 3fe98d264ebd MISC 1 [ 1561.519950] [Hardware Error]: PROCESSOR 0:206a7 TIME 1357862746 SOCKET 0 APIC 1 microcode 28 [ 1561.519951] [Hardware Error]: Run the above through 'mcelog --ascii' [ 1561.519953] [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 3: be00000000800400 [ 1561.519955] [Hardware Error]: TSC 539b174de9d ADDR 3fe98d264ebd MISC 1 [ 1561.519957] [Hardware Error]: PROCESSOR 0:206a7 TIME 1357862746 SOCKET 0 APIC 0 microcode 28 [ 1561.519958] [Hardware Error]: Run the above through 'mcelog --ascii' [ 1561.519959] [Hardware Error]: Machine check: Processor context corrupt [ 1561.519960] Kernel panic - not syncing: Fatal Machine check [ 1561.519962] Pid: 0, comm: swapper/5 Tainted: P M C O 3.2.0-35-generic #55-Ubuntu [ 1561.519963] Call Trace: [ 1561.519964] <#MC> [<ffffffff81644340>] panic+0x91/0x1a4 [ 1561.519971] [<ffffffff8102abeb>] mce_panic.part.14+0x18b/0x1c0 [ 1561.519973] [<ffffffff8102ac80>] mce_panic+0x60/0xb0 [ 1561.519975] [<ffffffff8102aec4>] mce_reign+0x1f4/0x200 [ 1561.519977] [<ffffffff8102b175>] mce_end+0xf5/0x100 [ 1561.519979] [<ffffffff8102b92c>] do_machine_check+0x3fc/0x600 [ 1561.519982] [<ffffffff8136d48f>] ? intel_idle+0xbf/0x150 [ 1561.519984] [<ffffffff8165d78c>] machine_check+0x1c/0x30 [ 1561.519986] [<ffffffff8136d48f>] ? intel_idle+0xbf/0x150 [ 1561.519987] <<EOE>> [<ffffffff81509697>] ? menu_select+0xe7/0x2c0 [ 1561.519991] [<ffffffff815082d1>] cpuidle_idle_call+0xc1/0x280 [ 1561.519994] [<ffffffff8101322a>] cpu_idle+0xca/0x120 [ 1561.519996] [<ffffffff8163aa9a>] start_secondary+0xd9/0xdb
bt
outputs the backtrace:PID: 0 TASK: ffff880211251700 CPU: 5 COMMAND: "swapper/5" #0 [ffff88021ed4aba0] machine_kexec at ffffffff8103947a #1 [ffff88021ed4ac10] crash_kexec at ffffffff810b52c8 #2 [ffff88021ed4ace0] panic at ffffffff81644347 #3 [ffff88021ed4ad60] mce_panic.part.14 at ffffffff8102abeb #4 [ffff88021ed4adb0] mce_panic at ffffffff8102ac80 #5 [ffff88021ed4ade0] mce_reign at ffffffff8102aec4 #6 [ffff88021ed4ae40] mce_end at ffffffff8102b175 #7 [ffff88021ed4ae70] do_machine_check at ffffffff8102b92c #8 [ffff88021ed4af50] machine_check at ffffffff8165d78c [exception RIP: intel_idle+191] RIP: ffffffff8136d48f RSP: ffff880211261e38 RFLAGS: 00000046 RAX: 0000000000000020 RBX: 0000000000000008 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffff880211261fd8 RDI: ffffffff81c12f00 RBP: ffff880211261e98 R8: 00000000fffffffc R9: 0000000000000f9f R10: 0000000000001e95 R11: 0000000000000000 R12: 0000000000000003 R13: ffff88021ed5ac70 R14: 0000000000000020 R15: 12d818fb42cfe42b ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <MCE exception stack> --- #9 [ffff880211261e38] intel_idle at ffffffff8136d48f #10 [ffff880211261ea0] cpuidle_idle_call at ffffffff815082d1 #11 [ffff880211261f00] cpu_idle at ffffffff8101322a
Any ideas?
-
Naftuli Kay over 11 yearsIf the overclock is the problem, I'll be able to see a clock cycle get missed in crash logs, so at the end of the day, I'll know what the problem is. That's my goal: to figure out what's going wrong. If it's my overclock, then fine, I'd just like to know what the problem is.
-
Scott Lamb over 11 yearsI don't think overclocking failures are as obvious as that to spot in the logs; I'm not a processor expert, but it's not like the whole processor correctly handles the clock cycle or indicates to the OS somehow that it missed it. Let me know if you have trouble getting logs, but IMHO by far the easiest way to know if it's an overclocking problem is to see if it happens when not overclocking.
-
Naftuli Kay over 11 yearsOkay, I'll do that after backing up my settings. I might first just see if I can reproduce the crash in Windows.
-
Naftuli Kay over 11 yearsWhile I'm thankful to never ever encounter a BSOD in Linux, it would seem strange to me that while Windows would log and display a problem, Linux wouldn't be able to.
-
Scott Lamb over 11 yearsOne of those little quirks. :-/ There's no fundamental reason Ubuntu or RedHat couldn't set up a nice kdump-based system for crash logging and display out of the box, but no one's done it as far as I know.
-
Naftuli Kay over 11 yearsLog results from that are pretty inconclusive: pastebin.com/VdYAHgiH
-
Scott Lamb over 11 yearsActually, I take that back. On Ubuntu, there is a linux-crashdump package you can install fairly easily to automatically put crashes in
/var/crash
. What distribution are you using? -
Naftuli Kay over 11 yearsI've updated the question, as I was able to crash the machine while running
linux-crashdump
and obtain a crash dump file which hopefully has enough information to determine the cause. -
Scott Lamb over 11 yearsSweet. Updated my answer as well.
-
Naftuli Kay over 11 yearsThanks, I'll look into that. I've heard that this issue pops up sometimes on UEFI motherboards when booting into BIOS legacy mode, which is the case on this system. This could explain why I haven't seen the issue on Windows, as it boots EFI. I'm also running
i7z
as a daemon in the background and it's probably doing some devious stuff to get live processor frequencies, C-states, and other stuff. Needless to say, I've disabled that and I'll see if it crashes again. -
Naftuli Kay over 11 yearsGot the log! Awesome help. I've updated the original post with the output of that log. I'm finally seeing the error now, any ideas on what might be causing it?
-
Scott Lamb over 11 yearsAll it means to me is that your processor isn't working. Probably the overclocking, maybe the other thing you mentioned (it's not something I've heard about), maybe a defective unit.
-
user3002906 over 11 yearsI would second the "Overclock is culprit" thought. MCE mostly occurs due to hardware issues. But, a segmentation fault in any module code can cause the same too. Two years back, my new i7 2600k was giving me the same MCE issue, even when I was not doing anything on the computer. When I dug a little deeper, I found the BIOS version I was using with my Intel motherboard was not properly supporting the then new processor. I updated the BIOS and the problem was gone. So I will suggest you to check on that route too.
-
Naftuli Kay over 11 yearsNow that I know what's failing, is there a way for me to cause the crash with a given command?
-
Scott Lamb over 11 yearsNot sure. I'm a software guy; this is the limit of my expertise.
-
Naftuli Kay about 11 yearsYes, it had everything to do with overclocking. Evidently, Windows seems to be a bit more fault-tolerant with certain processor faults, if the CPU can keep going. I might start booting with
mce=3
to prevent crashing, but in the past, I've simply increased the voltage each time it's crashed (which hasn't been so often). Something to note is that I'm using an offset voltage, which is generally speaking more unstable. -
dma_k about 9 yearsAs to my experience, kernel crashes rarely get into
kernel.log
, as log information needs to go a pretty long way via syslog, filesystem driver, disk cache and disk driver. Most simple and elegant way is to usenetconsole
kernel module.