Debug faulty hardware

5,356

Good on you for the instrumentation chosen, that's exactly how to run down a problem like this.

crash dump needs the linux debug symbols which are about 600MB per kernel which is why they're not installed by default. Here's how to install and invoke crash using the symbols.

https://wiki.ubuntu.com/Kernel/CrashdumpRecipe

It's a little late for me at the moment to do an in depth analysis of your machine check but my initial impression is that either the cache on the CPU or main memory is compromised.

I would demand a full warranty replacement.

If that's not possible swap out the ram, which is inexpensive test, and if the problem persists you can be reasonably confident that the CPU is the source. At which point I would seriously consider the trade off of replacing CPU towards the cost of a new computer.

Share:
5,356

Related videos on Youtube

Esokrates
Author by

Esokrates

Updated on September 18, 2022

Comments

  • Esokrates
    Esokrates almost 2 years

    My sister has got a laptop that has always crashed on her during Windows times (Blue screen) (hardware is relatively new and up to date). Back then she sent the dump files from Windows to Dell, who sent an engineer who changed the motherboard, but still after setting up Ubuntu in many different versions using many different kernels the panics would not go away.

    So I decided to take action in order to find the exact cause of the problem, I installed and configured the linux-crashdump package (kdump-tools) to automatically start a crash kernel using kexec that generates a dump file of the memory and also stores dmesg output. I also installed crash, the kernel-image-generic-dbgsym and mcelog in order to have everything to gather as much information as possible.

    So the Laptop crashed and the crash kernel successfully generated a dump file and stored the dmesg output. I also checked out /var/log/mcelog but the file was completely empty, although the daemon was running before the crash, which is strange, but after all we still have the dmesg output, that states:

    [ 3933.364173] mce: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 3: be00000000200135
    [ 3933.364177] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8171d9c2> {_raw_spin_lock+0x12/0x50}
    [ 3933.364182] mce: [Hardware Error]: TSC a0255fbd7f7 ADDR 42dd14480 MISC d62285 
    [ 3933.364185] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1398357146 SOCKET 0 APIC 1 microcode 15
    [ 3933.364186] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
    [ 3933.364188] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 3: be00000000200135
    [ 3933.364190] mce: [Hardware Error]: RIP !INEXACT! 33:<0000045a7992c1b5> 
    [ 3933.364191] mce: [Hardware Error]: TSC a0255fbd7f0 ADDR 42dd14480 MISC d62285 
    [ 3933.364194] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1398357146 SOCKET 0 APIC 0 microcode 15
    [ 3933.364195] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
    [ 3933.364196] mce: [Hardware Error]: Machine check: Processor context corrupt
    [ 3933.364197] Kernel panic - not syncing: Fatal Machine check
    

    So my first question would be, regarding "Run the above through 'mcelog --ascii'" ... what exactly should I run there and how? I tried for example:

    [ 3933.364173] mce: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 3: be00000000200135 | sudo mcelog --ascii
    

    which simply returned nothing. So what I am supposed to do here?

    I also ran

    crash  /usr/lib/debug/boot/vmlinux<kernelversion> /path/to/crashdump/file
    

    which started the program, as expected and I typed in bt to generate a backtrace which gave me:

    PID: 0      TASK: ffff8804177617f0  CPU: 6   COMMAND: "swapper/6"
     #0 [ffff88042dd89ca0] machine_kexec at ffffffff8104a732
     #1 [ffff88042dd89cf0] crash_kexec at ffffffff810e6ab3
     #2 [ffff88042dd89db8] panic at ffffffff8170ec6c
     #3 [ffff88042dd89e30] mce_panic at ffffffff8103687a
     #4 [ffff88042dd89e70] do_machine_check at ffffffff81038684
     #5 [ffff88042dd89f50] machine_check at ffffffff8171e25f
        [exception RIP: intel_idle+216]
        RIP: ffffffff813dfd78  RSP: ffff88041775de28  RFLAGS: 00000046
        RAX: 0000000000000001  RBX: 0000000000000002  RCX: 0000000000000001
        RDX: 0000000000000000  RSI: ffffffff81c93220  RDI: 0000000000000006
        RBP: ffff88041775de50   R8: ffff88042dd912d0   R9: 000000000000001c
        R10: 0000000000000320  R11: 0000000000000249  R12: 0000000000000002
        R13: 0000000000000001  R14: 0000000000000001  R15: ffffffff81c932e8
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    --- <MCE exception stack> ---
     #6 [ffff88041775de28] intel_idle at ffffffff813dfd78
     #7 [ffff88041775de58] cpuidle_enter_state at ffffffff815c9570
     #8 [ffff88041775de90] cpuidle_idle_call at ffffffff815c96a9
     #9 [ffff88041775ded0] arch_cpu_idle at ffffffff8101ceae
    #10 [ffff88041775dee0] cpu_startup_entry at ffffffff810beb85
    #11 [ffff88041775df30] start_secondary at ffffffff81040fc8
    

    To sum up, I would like to know, how I can invoke mcelog on the dmesg output and possibly what else steps you would take in order to get as much information about the problem as possible / find the faulty component, so that I can contact the hardware vendor already having an educated guess whats wrong.

    I know how that memcheck can help me to predict with high probability that the ram is no the cause.

    EDIT: I have found out, how to pass the output to mcelog correctly: Put the output lines before "Run the above through 'mcelog --ascii'" in a file and invoke mcelog with

    sudo mcelog --ascii < file 
    

    One can see that the "Run the above through 'mcelog --ascii'" message is printed two times in the dmesg file, so I invoked mcelog two times beginning with "CPU:" and ending the line before the message (I left the dmesg stuff like "[ 3933.364173] mce: [Hardware Error]:" away).

    So mcelog tells me:

    Hardware event. This is not a software error.
    CPU 4 BANK 3 TSC a0255fbd7f7 
    RIP !INEXACT! 10:ffffffff8171d9c2
    MISC d62285 ADDR 42dd14480 
    TIME 1398357146 Thu Apr 24 18:32:26 2014
    MCG status:RIPV MCIP 
    MCi status:
    Uncorrected error
    Error enabled
    MCi_MISC register valid
    MCi_ADDR register valid
    Processor context corrupt
    MCA: Data CACHE Level-1 Data-Read Error
    STATUS be00000000200135 MCGSTATUS 5
    CPUID Vendor Intel Family 6 Model 58
    RIP: _raw_spin_lock+0x12/0x50}                                                        
    SOCKET 0 APIC 1 microcode 15 
    

    and

    Hardware event. This is not a software error.                                                                         
    CPU 0 BANK 3 TSC a0255fbd7f0 
    RIP !INEXACT! 33:45a7992c1b5
    MISC d62285 ADDR 42dd14480 
    TIME 1398357146 Thu Apr 24 18:32:26 2014
    MCG status:RIPV MCIP 
    MCi status:
    Uncorrected error
    Error enabled
    MCi_MISC register valid
    MCi_ADDR register valid
    Processor context corrupt
    MCA: Data CACHE Level-1 Data-Read Error
    STATUS be00000000200135 MCGSTATUS 5
    CPUID Vendor Intel Family 6 Model 58
    SOCKET 0 APIC 0 microcode 15
    

    so assuming that the motherboard is okay (as it has been changed) and if RAM is okay there is only the CPU left to be the troublemaker right? Is anyone familiar with all the output given?

    • Panther
      Panther about 10 years
    • Esokrates
      Esokrates about 10 years
      bodhi.zazen: As I stated in the text: I installed mcelog and the daemon is running, but there was no output stored in /var/log/mcelog. That article does not state anything that could help me here, but thanks for your comment.
    • Panther
      Panther about 10 years
      I've not any experience with your problem, but as it seems to be a hardware problem, I am not sure there is a fix beyond replacing your hardware. You could file a bug against the linux kernel and see if the kernel developers can help.
    • psusi
      psusi about 10 years
      I'd say you have a bad cpu.
    • Esokrates
      Esokrates about 10 years
      psusi: is there a special reason you suspect the cpu?
  • Esokrates
    Esokrates about 10 years
    Hi, Thanks for your answer. Is there anything additional I can provide for you to have a quick look on it? The linux debug symbols were installed, I invoked the crash utility with the path to the symbols and ran bt. The output is attached in the initial question. A memcheck86+ passed without any errors, so it seems not to be the memory. Is swapping the memory modules still necessary? What is really strange with this issue is, that the laptop did not panic since I asked the question and paniced 4 times that day I asked the question (it cannot be overheating), so reproducing is not easy.
  • Esokrates
    Esokrates about 10 years
    Okay crash happened again today (under windows, when my sister had to use a microsoft program). What do you mean by in depth analysis and what steps would it involve? The issue is strange, because the motherboard has been exchanged, memcheck86+ shows no errors and the Intel® Processor Diagnostic Tool passed successfully (under windows). If it was main memory, memcheck86+ would detect that, wouldn't it? How do I check the cpu cache? One thing out of interest: Could mce=3 boot option be a possible workaround for some time? What are the risks?
  • ppetraki
    ppetraki about 10 years
    Yeah... tests lie, well not deliberately, in so much that it has a strong opinion of the "testing model". So unit testing memory may not reproduce the issue because of other contributing factors that only running a full OS can provide. This is why the "linux kernel build" test is so good as it demands all sorts of memory and io in patterns and timing a unit test can't reproduce.
  • ppetraki
    ppetraki about 10 years
    I was a little terse when I said "cpu cache", what I was implying is the silicon yield on the chip produced errors in that area which compromised it's memory. So component wise for troubleshooting the memory and cpu should be swapped for known good components.
  • ppetraki
    ppetraki about 10 years
    the consequences of ignoring uncorrectable machine checks is silent data corruption. im still of the opinion that the entire machine should be replaced under warranty. you already paid for it once, you shouldn't have to pay more to troubleshoot your own kit. Also the costs of those two components can easily amount to 30-40% of the total purchase price.