Troubleshooting kernel panic -- next steps after analyzing `crash` dump

7,242

Peter got here with a comment first which I won't repeat, so do have a look at that wikipedia article.

Here's what mcelog says for me here:

> mcelog --ascii

This reads from stdin. I then cut n' paste:

TSC 1d5211c92ee8 ADDR 419801540 MISC 86
PROCESSOR 0:306c3 TIME 1390210166 SOCKET 0 APIC 2 microcode 9

From your log. This spit out:

Hardware event. This is not a software error.
CPU 0 BANK 0 TSC 1d5211c92ee8 
TIME 1390210166 Mon Jan 20 04:29:26 2014
MCG status:
MCi status:
Machine check not valid
Corrected error
MCA: No Error
STATUS 0 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 60
SOCKET 0 APIC 2 microcode 9

However, it requires permission on /dev/mem, so if it needed information from there it can't get on my system, this may not be the whole truth.

An easy to diagnose and fix MCE would be bad ram. To check properly for that, you can use memtest. Memtest is a bare metal program (i.e., runs by itself without an operating system) that you must boot. I believe most live CD's have it on -- at least, the fedora ones I have kicking around do. It's one of the options in the grub menu (?) when the disk first loads. Some BIOS's have it as an option too.

Share:
7,242

Related videos on Youtube

Bryce Thomas
Author by

Bryce Thomas

https://www.linkedin.com/in/brycethomas/

Updated on September 18, 2022

Comments

  • Bryce Thomas
    Bryce Thomas almost 2 years

    I'm running an Ubuntu 12.04 machine with an Intel i7 CPU. Every now and then, it freezes up completely -- the machine and display stay on for a minute or so, the audio that was playing (if any happened to be playing) starts looping and then it reboots. While this typically happens under moderate to high load, it also happens occasionally when nothing is running. The crash occurs regardless of whether the Intel XMP 1333 MHz boosted to 1600 MHz memory setting is turned on in the BIOS. I haven't touched any of the other overclocking settings either. Here are the obvious things I've done so far:

    Observed CPU temperature

    Following the guide here, I setup temperature instrumentation on the machine and then ran watch sensors so I could get a continuous reading of the CPU's temperature. The machine freezes despite all cores operating at reasonable temperatures (~60-65 degrees).

    Observed RAM usage

    The machine has 16 GB of RAM. It freezes despite only ~ 3 GB of that being in use.

    Observed CPU Utilization

    As said prior, the freeze up occurs more often under load, but has also occurred at "idle".

    None of these obvious things pointed to the problem, so I followed the guide here to get up and running with the crash command to help me diagnose the problem.

    Here is the output of running crash on the crash dump:

    crash /usr/lib/debug/boot/vmlinux-3.2.0-58-generic ~/temp/crash2/VmCore
    
    crash 6.1.6
    Copyright (C) 2002-2013  Red Hat, Inc.
    Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
    Copyright (C) 1999-2006  Hewlett-Packard Co
    Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
    Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
    Copyright (C) 2005, 2011  NEC Corporation
    Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
    Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
    This program is free software, covered by the GNU General Public License,
    and you are welcome to change it and/or distribute copies of it under
    certain conditions.  Enter "help copying" to see the conditions.
    This program has absolutely no warranty.  Enter "help warranty" for details.
    
    GNU gdb (GDB) 7.3.1
    Copyright (C) 2011 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
    and "show warranty" for details.
    This GDB was configured as "x86_64-unknown-linux-gnu"...
    
          KERNEL: /usr/lib/debug/boot/vmlinux-3.2.0-58-generic
        DUMPFILE: /home/bryce/temp/crash2/VmCore
            CPUS: 8
            DATE: Mon Jan 20 19:29:26 2014
          UPTIME: 02:33:18
    LOAD AVERAGE: 1.40, 0.93, 0.50
           TASKS: 579
        NODENAME: <node_name>
         RELEASE: 3.2.0-58-generic
         VERSION: #88-Ubuntu SMP Tue Dec 3 17:37:58 UTC 2013
         MACHINE: x86_64  (3499 Mhz)
          MEMORY: 16 GB
           PANIC: "[ 9180.518213] Kernel panic - not syncing: Fatal Machine check"
             PID: 0
         COMMAND: "swapper/1"
            TASK: ffff8804045e9700  (1 of 8)  [THREAD_INFO: ffff8804045e4000]
             CPU: 1
           STATE: TASK_RUNNING (PANIC)
    

    And the result of running bt at the crash prompt:

    crash> bt
    PID: 0      TASK: ffff8804045e9700  CPU: 1   COMMAND: "swapper/1"
     #0 [ffff88041ec4aba0] machine_kexec at ffffffff8103943a
     #1 [ffff88041ec4ac10] crash_kexec at ffffffff810b58d8
     #2 [ffff88041ec4ace0] panic at ffffffff8164928c
     #3 [ffff88041ec4ad60] mce_panic at ffffffff8102ab0b
     #4 [ffff88041ec4adb0] mce_panic at ffffffff8102aba0
     #5 [ffff88041ec4ade0] mce_reign at ffffffff8102ade4
     #6 [ffff88041ec4ae40] mce_end at ffffffff8102b095
     #7 [ffff88041ec4ae70] do_machine_check at ffffffff8102b84c
     #8 [ffff88041ec4af50] machine_check at ffffffff8166254c
        [exception RIP: mwait_idle_with_hints+93]
        RIP: ffffffff8103109d  RSP: ffff8804045e5e38  RFLAGS: 00000046
        RAX: 0000000000000033  RBX: ffff88040009ba60  RCX: 0000000000000001
        RDX: 0000000000000000  RSI: 0000000000000001  RDI: 0000000000000033
        RBP: ffff8804045e5e38   R8: ffff8804045e5fd8   R9: 0000000000000f85
        R10: 000000000000198f  R11: 0000000000000000  R12: 0000000000000002
        R13: ffff88040009b800  R14: ffff88040009b820  R15: 134b04c7ed0ea230
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    --- <MCE exception stack> ---
     #9 [ffff8804045e5e38] mwait_idle_with_hints at ffffffff8103109d
    #10 [ffff8804045e5e40] acpi_processor_ffh_cstate_enter at ffffffff810310e2
    #11 [ffff8804045e5e50] acpi_idle_do_entry at ffffffff81399182
    #12 [ffff8804045e5e60] acpi_idle_enter_simple at ffffffff813992d8
    #13 [ffff8804045e5ea0] cpuidle_idle_call at ffffffff8150bd61
    #14 [ffff8804045e5f00] cpu_idle at ffffffff8101322a
    

    And the result of running log at the crash prompt (I've only shown the tail of the file, with the first displayed line at time 4232.799 as earlier context - error details start on the following line):

    crash>log
    [ 4232.799853] ath: Could not kill baseband RX
    [ 9180.518188] [Hardware Error]: CPU 5: Machine Check Exception: 5 Bank 1: bf80000000000124
    [ 9180.518191] [Hardware Error]: RIP !INEXACT! 10:<ffffffff8111fc4f> {__rmqueue+0x1f/0x4b0}
    [ 9180.518196] [Hardware Error]: TSC 1d5211c92f0e ADDR 419801540 MISC 86 
    [ 9180.518199] [Hardware Error]: PROCESSOR 0:306c3 TIME 1390210166 SOCKET 0 APIC 3 microcode 9
    [ 9180.518200] [Hardware Error]: Run the above through 'mcelog --ascii'
    [ 9180.518202] [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 1: bf80000000000124
    [ 9180.518204] [Hardware Error]: RIP !INEXACT! 10:<ffffffff8103109d> {mwait_idle_with_hints+0x5d/0x70}
    [ 9180.518207] [Hardware Error]: TSC 1d5211c92ee8 ADDR 419801540 MISC 86 
    [ 9180.518209] [Hardware Error]: PROCESSOR 0:306c3 TIME 1390210166 SOCKET 0 APIC 2 microcode 9
    [ 9180.518210] [Hardware Error]: Run the above through 'mcelog --ascii'
    [ 9180.518211] [Hardware Error]: Machine check: Processor context corrupt
    [ 9180.518213] Kernel panic - not syncing: Fatal Machine check
    [ 9180.518215] Pid: 0, comm: swapper/1 Tainted: P   M       O 3.2.0-58-generic #88-Ubuntu
    [ 9180.518216] Call Trace:
    [ 9180.518217]  <#MC>  [<ffffffff81649285>] panic+0x91/0x1a4
    [ 9180.518224]  [<ffffffff8102ab0b>] mce_panic.part.14+0x18b/0x1c0
    [ 9180.518226]  [<ffffffff8102aba0>] mce_panic+0x60/0xb0
    [ 9180.518228]  [<ffffffff8102ade4>] mce_reign+0x1f4/0x200
    [ 9180.518230]  [<ffffffff8102b095>] mce_end+0xf5/0x100
    [ 9180.518232]  [<ffffffff8102b84c>] do_machine_check+0x3fc/0x600
    [ 9180.518234]  [<ffffffff8103109d>] ? mwait_idle_with_hints+0x5d/0x70
    [ 9180.518237]  [<ffffffff8166254c>] machine_check+0x1c/0x30
    [ 9180.518239]  [<ffffffff8103109d>] ? mwait_idle_with_hints+0x5d/0x70
    [ 9180.518240]  <<EOE>>  [<ffffffff810310e2>] acpi_processor_ffh_cstate_enter+0x32/0x40
    [ 9180.518244]  [<ffffffff81399182>] acpi_idle_do_entry+0x10/0x2b
    [ 9180.518246]  [<ffffffff813992d8>] acpi_idle_enter_simple+0xaa/0x115
    [ 9180.518249]  [<ffffffff8150bd61>] cpuidle_idle_call+0xc1/0x290
    [ 9180.518252]  [<ffffffff8101322a>] cpu_idle+0xca/0x120
    [ 9180.518255]  [<ffffffff8163fa12>] start_secondary+0xd9/0xdb
    

    So given this information, what is the next step in diagnosing this problem?

    • peterph
      peterph over 10 years
      Next step would seem to be following the instruiction on the log: Run the above through 'mcelog --ascii'. By the way, have you run memtest thoroughly (i.e. for a couple of hours)? Have you read en.wikipedia.org/wiki/Machine-check_exception ?