Troubleshooting kernel panic -- next steps after analyzing `crash` dump
Peter got here with a comment first which I won't repeat, so do have a look at that wikipedia article.
Here's what mcelog says for me here:
> mcelog --ascii
This reads from stdin. I then cut n' paste:
TSC 1d5211c92ee8 ADDR 419801540 MISC 86
PROCESSOR 0:306c3 TIME 1390210166 SOCKET 0 APIC 2 microcode 9
From your log. This spit out:
Hardware event. This is not a software error.
CPU 0 BANK 0 TSC 1d5211c92ee8
TIME 1390210166 Mon Jan 20 04:29:26 2014
MCG status:
MCi status:
Machine check not valid
Corrected error
MCA: No Error
STATUS 0 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 60
SOCKET 0 APIC 2 microcode 9
However, it requires permission on /dev/mem
, so if it needed information from there it can't get on my system, this may not be the whole truth.
An easy to diagnose and fix MCE would be bad ram. To check properly for that, you can use memtest. Memtest is a bare metal program (i.e., runs by itself without an operating system) that you must boot. I believe most live CD's have it on -- at least, the fedora ones I have kicking around do. It's one of the options in the grub menu (?) when the disk first loads. Some BIOS's have it as an option too.
Related videos on Youtube
Comments
-
Bryce Thomas almost 2 years
I'm running an Ubuntu 12.04 machine with an Intel i7 CPU. Every now and then, it freezes up completely -- the machine and display stay on for a minute or so, the audio that was playing (if any happened to be playing) starts looping and then it reboots. While this typically happens under moderate to high load, it also happens occasionally when nothing is running. The crash occurs regardless of whether the Intel XMP 1333 MHz boosted to 1600 MHz memory setting is turned on in the BIOS. I haven't touched any of the other overclocking settings either. Here are the obvious things I've done so far:
Observed CPU temperature
Following the guide here, I setup temperature instrumentation on the machine and then ran
watch sensors
so I could get a continuous reading of the CPU's temperature. The machine freezes despite all cores operating at reasonable temperatures (~60-65 degrees).Observed RAM usage
The machine has 16 GB of RAM. It freezes despite only ~ 3 GB of that being in use.
Observed CPU Utilization
As said prior, the freeze up occurs more often under load, but has also occurred at "idle".
None of these obvious things pointed to the problem, so I followed the guide here to get up and running with the
crash
command to help me diagnose the problem.Here is the output of running
crash
on the crash dump:crash /usr/lib/debug/boot/vmlinux-3.2.0-58-generic ~/temp/crash2/VmCore crash 6.1.6 Copyright (C) 2002-2013 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.3.1 Copyright (C) 2011 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... KERNEL: /usr/lib/debug/boot/vmlinux-3.2.0-58-generic DUMPFILE: /home/bryce/temp/crash2/VmCore CPUS: 8 DATE: Mon Jan 20 19:29:26 2014 UPTIME: 02:33:18 LOAD AVERAGE: 1.40, 0.93, 0.50 TASKS: 579 NODENAME: <node_name> RELEASE: 3.2.0-58-generic VERSION: #88-Ubuntu SMP Tue Dec 3 17:37:58 UTC 2013 MACHINE: x86_64 (3499 Mhz) MEMORY: 16 GB PANIC: "[ 9180.518213] Kernel panic - not syncing: Fatal Machine check" PID: 0 COMMAND: "swapper/1" TASK: ffff8804045e9700 (1 of 8) [THREAD_INFO: ffff8804045e4000] CPU: 1 STATE: TASK_RUNNING (PANIC)
And the result of running
bt
at thecrash
prompt:crash> bt PID: 0 TASK: ffff8804045e9700 CPU: 1 COMMAND: "swapper/1" #0 [ffff88041ec4aba0] machine_kexec at ffffffff8103943a #1 [ffff88041ec4ac10] crash_kexec at ffffffff810b58d8 #2 [ffff88041ec4ace0] panic at ffffffff8164928c #3 [ffff88041ec4ad60] mce_panic at ffffffff8102ab0b #4 [ffff88041ec4adb0] mce_panic at ffffffff8102aba0 #5 [ffff88041ec4ade0] mce_reign at ffffffff8102ade4 #6 [ffff88041ec4ae40] mce_end at ffffffff8102b095 #7 [ffff88041ec4ae70] do_machine_check at ffffffff8102b84c #8 [ffff88041ec4af50] machine_check at ffffffff8166254c [exception RIP: mwait_idle_with_hints+93] RIP: ffffffff8103109d RSP: ffff8804045e5e38 RFLAGS: 00000046 RAX: 0000000000000033 RBX: ffff88040009ba60 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000033 RBP: ffff8804045e5e38 R8: ffff8804045e5fd8 R9: 0000000000000f85 R10: 000000000000198f R11: 0000000000000000 R12: 0000000000000002 R13: ffff88040009b800 R14: ffff88040009b820 R15: 134b04c7ed0ea230 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <MCE exception stack> --- #9 [ffff8804045e5e38] mwait_idle_with_hints at ffffffff8103109d #10 [ffff8804045e5e40] acpi_processor_ffh_cstate_enter at ffffffff810310e2 #11 [ffff8804045e5e50] acpi_idle_do_entry at ffffffff81399182 #12 [ffff8804045e5e60] acpi_idle_enter_simple at ffffffff813992d8 #13 [ffff8804045e5ea0] cpuidle_idle_call at ffffffff8150bd61 #14 [ffff8804045e5f00] cpu_idle at ffffffff8101322a
And the result of running
log
at thecrash
prompt (I've only shown the tail of the file, with the first displayed line at time 4232.799 as earlier context - error details start on the following line):crash>log [ 4232.799853] ath: Could not kill baseband RX [ 9180.518188] [Hardware Error]: CPU 5: Machine Check Exception: 5 Bank 1: bf80000000000124 [ 9180.518191] [Hardware Error]: RIP !INEXACT! 10:<ffffffff8111fc4f> {__rmqueue+0x1f/0x4b0} [ 9180.518196] [Hardware Error]: TSC 1d5211c92f0e ADDR 419801540 MISC 86 [ 9180.518199] [Hardware Error]: PROCESSOR 0:306c3 TIME 1390210166 SOCKET 0 APIC 3 microcode 9 [ 9180.518200] [Hardware Error]: Run the above through 'mcelog --ascii' [ 9180.518202] [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 1: bf80000000000124 [ 9180.518204] [Hardware Error]: RIP !INEXACT! 10:<ffffffff8103109d> {mwait_idle_with_hints+0x5d/0x70} [ 9180.518207] [Hardware Error]: TSC 1d5211c92ee8 ADDR 419801540 MISC 86 [ 9180.518209] [Hardware Error]: PROCESSOR 0:306c3 TIME 1390210166 SOCKET 0 APIC 2 microcode 9 [ 9180.518210] [Hardware Error]: Run the above through 'mcelog --ascii' [ 9180.518211] [Hardware Error]: Machine check: Processor context corrupt [ 9180.518213] Kernel panic - not syncing: Fatal Machine check [ 9180.518215] Pid: 0, comm: swapper/1 Tainted: P M O 3.2.0-58-generic #88-Ubuntu [ 9180.518216] Call Trace: [ 9180.518217] <#MC> [<ffffffff81649285>] panic+0x91/0x1a4 [ 9180.518224] [<ffffffff8102ab0b>] mce_panic.part.14+0x18b/0x1c0 [ 9180.518226] [<ffffffff8102aba0>] mce_panic+0x60/0xb0 [ 9180.518228] [<ffffffff8102ade4>] mce_reign+0x1f4/0x200 [ 9180.518230] [<ffffffff8102b095>] mce_end+0xf5/0x100 [ 9180.518232] [<ffffffff8102b84c>] do_machine_check+0x3fc/0x600 [ 9180.518234] [<ffffffff8103109d>] ? mwait_idle_with_hints+0x5d/0x70 [ 9180.518237] [<ffffffff8166254c>] machine_check+0x1c/0x30 [ 9180.518239] [<ffffffff8103109d>] ? mwait_idle_with_hints+0x5d/0x70 [ 9180.518240] <<EOE>> [<ffffffff810310e2>] acpi_processor_ffh_cstate_enter+0x32/0x40 [ 9180.518244] [<ffffffff81399182>] acpi_idle_do_entry+0x10/0x2b [ 9180.518246] [<ffffffff813992d8>] acpi_idle_enter_simple+0xaa/0x115 [ 9180.518249] [<ffffffff8150bd61>] cpuidle_idle_call+0xc1/0x290 [ 9180.518252] [<ffffffff8101322a>] cpu_idle+0xca/0x120 [ 9180.518255] [<ffffffff8163fa12>] start_secondary+0xd9/0xdb
So given this information, what is the next step in diagnosing this problem?
-
peterph over 10 yearsNext step would seem to be following the instruiction on the log:
Run the above through 'mcelog --ascii'
. By the way, have you run memtest thoroughly (i.e. for a couple of hours)? Have you read en.wikipedia.org/wiki/Machine-check_exception ?
-