Debug faulty hardware

kernel debugging debug

5,356

Good on you for the instrumentation chosen, that's exactly how to run down a problem like this.

crash dump needs the linux debug symbols which are about 600MB per kernel which is why they're not installed by default. Here's how to install and invoke crash using the symbols.

https://wiki.ubuntu.com/Kernel/CrashdumpRecipe

It's a little late for me at the moment to do an in depth analysis of your machine check but my initial impression is that either the cache on the CPU or main memory is compromised.

I would demand a full warranty replacement.

If that's not possible swap out the ram, which is inexpensive test, and if the problem persists you can be reasonably confident that the CPU is the source. At which point I would seriously consider the trade off of replacing CPU towards the cost of a new computer.

5,356

Esokrates

Updated on September 18, 2022

Comments

Esokrates almost 2 years

My sister has got a laptop that has always crashed on her during Windows times (Blue screen) (hardware is relatively new and up to date). Back then she sent the dump files from Windows to Dell, who sent an engineer who changed the motherboard, but still after setting up Ubuntu in many different versions using many different kernels the panics would not go away.

So I decided to take action in order to find the exact cause of the problem, I installed and configured the linux-crashdump package (kdump-tools) to automatically start a crash kernel using kexec that generates a dump file of the memory and also stores dmesg output. I also installed crash, the kernel-image-generic-dbgsym and mcelog in order to have everything to gather as much information as possible.

So the Laptop crashed and the crash kernel successfully generated a dump file and stored the dmesg output. I also checked out /var/log/mcelog but the file was completely empty, although the daemon was running before the crash, which is strange, but after all we still have the dmesg output, that states:

[ 3933.364173] mce: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 3: be00000000200135
[ 3933.364177] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8171d9c2> {_raw_spin_lock+0x12/0x50}
[ 3933.364182] mce: [Hardware Error]: TSC a0255fbd7f7 ADDR 42dd14480 MISC d62285 
[ 3933.364185] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1398357146 SOCKET 0 APIC 1 microcode 15
[ 3933.364186] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 3933.364188] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 3: be00000000200135
[ 3933.364190] mce: [Hardware Error]: RIP !INEXACT! 33:<0000045a7992c1b5> 
[ 3933.364191] mce: [Hardware Error]: TSC a0255fbd7f0 ADDR 42dd14480 MISC d62285 
[ 3933.364194] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1398357146 SOCKET 0 APIC 0 microcode 15
[ 3933.364195] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 3933.364196] mce: [Hardware Error]: Machine check: Processor context corrupt
[ 3933.364197] Kernel panic - not syncing: Fatal Machine check

So my first question would be, regarding "Run the above through 'mcelog --ascii'" ... what exactly should I run there and how? I tried for example:

[ 3933.364173] mce: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 3: be00000000200135 | sudo mcelog --ascii

which simply returned nothing. So what I am supposed to do here?

I also ran

crash  /usr/lib/debug/boot/vmlinux<kernelversion> /path/to/crashdump/file

which started the program, as expected and I typed in bt to generate a backtrace which gave me:

PID: 0      TASK: ffff8804177617f0  CPU: 6   COMMAND: "swapper/6"
 #0 [ffff88042dd89ca0] machine_kexec at ffffffff8104a732
 #1 [ffff88042dd89cf0] crash_kexec at ffffffff810e6ab3
 #2 [ffff88042dd89db8] panic at ffffffff8170ec6c
 #3 [ffff88042dd89e30] mce_panic at ffffffff8103687a
 #4 [ffff88042dd89e70] do_machine_check at ffffffff81038684
 #5 [ffff88042dd89f50] machine_check at ffffffff8171e25f
    [exception RIP: intel_idle+216]
    RIP: ffffffff813dfd78  RSP: ffff88041775de28  RFLAGS: 00000046
    RAX: 0000000000000001  RBX: 0000000000000002  RCX: 0000000000000001
    RDX: 0000000000000000  RSI: ffffffff81c93220  RDI: 0000000000000006
    RBP: ffff88041775de50   R8: ffff88042dd912d0   R9: 000000000000001c
    R10: 0000000000000320  R11: 0000000000000249  R12: 0000000000000002
    R13: 0000000000000001  R14: 0000000000000001  R15: ffffffff81c932e8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <MCE exception stack> ---
 #6 [ffff88041775de28] intel_idle at ffffffff813dfd78
 #7 [ffff88041775de58] cpuidle_enter_state at ffffffff815c9570
 #8 [ffff88041775de90] cpuidle_idle_call at ffffffff815c96a9
 #9 [ffff88041775ded0] arch_cpu_idle at ffffffff8101ceae
#10 [ffff88041775dee0] cpu_startup_entry at ffffffff810beb85
#11 [ffff88041775df30] start_secondary at ffffffff81040fc8

To sum up, I would like to know, how I can invoke mcelog on the dmesg output and possibly what else steps you would take in order to get as much information about the problem as possible / find the faulty component, so that I can contact the hardware vendor already having an educated guess whats wrong.

I know how that memcheck can help me to predict with high probability that the ram is no the cause.

EDIT: I have found out, how to pass the output to mcelog correctly: Put the output lines before "Run the above through 'mcelog --ascii'" in a file and invoke mcelog with

sudo mcelog --ascii < file

One can see that the "Run the above through 'mcelog --ascii'" message is printed two times in the dmesg file, so I invoked mcelog two times beginning with "CPU:" and ending the line before the message (I left the dmesg stuff like "[ 3933.364173] mce: [Hardware Error]:" away).

So mcelog tells me:

Hardware event. This is not a software error.
CPU 4 BANK 3 TSC a0255fbd7f7 
RIP !INEXACT! 10:ffffffff8171d9c2
MISC d62285 ADDR 42dd14480 
TIME 1398357146 Thu Apr 24 18:32:26 2014
MCG status:RIPV MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Data CACHE Level-1 Data-Read Error
STATUS be00000000200135 MCGSTATUS 5
CPUID Vendor Intel Family 6 Model 58
RIP: _raw_spin_lock+0x12/0x50}                                                        
SOCKET 0 APIC 1 microcode 15

and

Hardware event. This is not a software error.                                                                         
CPU 0 BANK 3 TSC a0255fbd7f0 
RIP !INEXACT! 33:45a7992c1b5
MISC d62285 ADDR 42dd14480 
TIME 1398357146 Thu Apr 24 18:32:26 2014
MCG status:RIPV MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Data CACHE Level-1 Data-Read Error
STATUS be00000000200135 MCGSTATUS 5
CPUID Vendor Intel Family 6 Model 58
SOCKET 0 APIC 0 microcode 15

so assuming that the motherboard is okay (as it has been changed) and if RAM is okay there is only the CPU left to be the troublemaker right? Is anyone familiar with all the output given?

Panther about 10 years

See cyberciti.biz/tips/…
Esokrates about 10 years

bodhi.zazen: As I stated in the text: I installed mcelog and the daemon is running, but there was no output stored in /var/log/mcelog. That article does not state anything that could help me here, but thanks for your comment.
Panther about 10 years

I've not any experience with your problem, but as it seems to be a hardware problem, I am not sure there is a fix beyond replacing your hardware. You could file a bug against the linux kernel and see if the kernel developers can help.
psusi about 10 years

I'd say you have a bad cpu.
Esokrates about 10 years

psusi: is there a special reason you suspect the cpu?

Esokrates about 10 years

Hi, Thanks for your answer. Is there anything additional I can provide for you to have a quick look on it? The linux debug symbols were installed, I invoked the crash utility with the path to the symbols and ran bt. The output is attached in the initial question. A memcheck86+ passed without any errors, so it seems not to be the memory. Is swapping the memory modules still necessary? What is really strange with this issue is, that the laptop did not panic since I asked the question and paniced 4 times that day I asked the question (it cannot be overheating), so reproducing is not easy.
Esokrates about 10 years

Okay crash happened again today (under windows, when my sister had to use a microsoft program). What do you mean by in depth analysis and what steps would it involve? The issue is strange, because the motherboard has been exchanged, memcheck86+ shows no errors and the Intel® Processor Diagnostic Tool passed successfully (under windows). If it was main memory, memcheck86+ would detect that, wouldn't it? How do I check the cpu cache? One thing out of interest: Could mce=3 boot option be a possible workaround for some time? What are the risks?
ppetraki about 10 years

Yeah... tests lie, well not deliberately, in so much that it has a strong opinion of the "testing model". So unit testing memory may not reproduce the issue because of other contributing factors that only running a full OS can provide. This is why the "linux kernel build" test is so good as it demands all sorts of memory and io in patterns and timing a unit test can't reproduce.
ppetraki about 10 years

I was a little terse when I said "cpu cache", what I was implying is the silicon yield on the chip produced errors in that area which compromised it's memory. So component wise for troubleshooting the memory and cpu should be swapped for known good components.
ppetraki about 10 years

the consequences of ignoring uncorrectable machine checks is silent data corruption. im still of the opinion that the entire machine should be replaced under warranty. you already paid for it once, you shouldn't have to pay more to troubleshoot your own kit. Also the costs of those two components can easily amount to 30-40% of the total purchase price.