Debug faulty hardware
Good on you for the instrumentation chosen, that's exactly how to run down a problem like this.
crash dump needs the linux debug symbols which are about 600MB per kernel which is why they're not installed by default. Here's how to install and invoke crash using the symbols.
https://wiki.ubuntu.com/Kernel/CrashdumpRecipe
It's a little late for me at the moment to do an in depth analysis of your machine check but my initial impression is that either the cache on the CPU or main memory is compromised.
I would demand a full warranty replacement.
If that's not possible swap out the ram, which is inexpensive test, and if the problem persists you can be reasonably confident that the CPU is the source. At which point I would seriously consider the trade off of replacing CPU towards the cost of a new computer.
Related videos on Youtube
Esokrates
Updated on September 18, 2022Comments
-
Esokrates almost 2 years
My sister has got a laptop that has always crashed on her during Windows times (Blue screen) (hardware is relatively new and up to date). Back then she sent the dump files from Windows to Dell, who sent an engineer who changed the motherboard, but still after setting up Ubuntu in many different versions using many different kernels the panics would not go away.
So I decided to take action in order to find the exact cause of the problem, I installed and configured the linux-crashdump package (kdump-tools) to automatically start a crash kernel using kexec that generates a dump file of the memory and also stores dmesg output. I also installed crash, the kernel-image-generic-dbgsym and mcelog in order to have everything to gather as much information as possible.
So the Laptop crashed and the crash kernel successfully generated a dump file and stored the dmesg output. I also checked out /var/log/mcelog but the file was completely empty, although the daemon was running before the crash, which is strange, but after all we still have the dmesg output, that states:
[ 3933.364173] mce: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 3: be00000000200135 [ 3933.364177] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8171d9c2> {_raw_spin_lock+0x12/0x50} [ 3933.364182] mce: [Hardware Error]: TSC a0255fbd7f7 ADDR 42dd14480 MISC d62285 [ 3933.364185] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1398357146 SOCKET 0 APIC 1 microcode 15 [ 3933.364186] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [ 3933.364188] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 3: be00000000200135 [ 3933.364190] mce: [Hardware Error]: RIP !INEXACT! 33:<0000045a7992c1b5> [ 3933.364191] mce: [Hardware Error]: TSC a0255fbd7f0 ADDR 42dd14480 MISC d62285 [ 3933.364194] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1398357146 SOCKET 0 APIC 0 microcode 15 [ 3933.364195] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [ 3933.364196] mce: [Hardware Error]: Machine check: Processor context corrupt [ 3933.364197] Kernel panic - not syncing: Fatal Machine check
So my first question would be, regarding "Run the above through 'mcelog --ascii'" ... what exactly should I run there and how? I tried for example:
[ 3933.364173] mce: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 3: be00000000200135 | sudo mcelog --ascii
which simply returned nothing. So what I am supposed to do here?
I also ran
crash /usr/lib/debug/boot/vmlinux<kernelversion> /path/to/crashdump/file
which started the program, as expected and I typed in
bt
to generate a backtrace which gave me:PID: 0 TASK: ffff8804177617f0 CPU: 6 COMMAND: "swapper/6" #0 [ffff88042dd89ca0] machine_kexec at ffffffff8104a732 #1 [ffff88042dd89cf0] crash_kexec at ffffffff810e6ab3 #2 [ffff88042dd89db8] panic at ffffffff8170ec6c #3 [ffff88042dd89e30] mce_panic at ffffffff8103687a #4 [ffff88042dd89e70] do_machine_check at ffffffff81038684 #5 [ffff88042dd89f50] machine_check at ffffffff8171e25f [exception RIP: intel_idle+216] RIP: ffffffff813dfd78 RSP: ffff88041775de28 RFLAGS: 00000046 RAX: 0000000000000001 RBX: 0000000000000002 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffffffff81c93220 RDI: 0000000000000006 RBP: ffff88041775de50 R8: ffff88042dd912d0 R9: 000000000000001c R10: 0000000000000320 R11: 0000000000000249 R12: 0000000000000002 R13: 0000000000000001 R14: 0000000000000001 R15: ffffffff81c932e8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <MCE exception stack> --- #6 [ffff88041775de28] intel_idle at ffffffff813dfd78 #7 [ffff88041775de58] cpuidle_enter_state at ffffffff815c9570 #8 [ffff88041775de90] cpuidle_idle_call at ffffffff815c96a9 #9 [ffff88041775ded0] arch_cpu_idle at ffffffff8101ceae #10 [ffff88041775dee0] cpu_startup_entry at ffffffff810beb85 #11 [ffff88041775df30] start_secondary at ffffffff81040fc8
To sum up, I would like to know, how I can invoke
mcelog
on the dmesg output and possibly what else steps you would take in order to get as much information about the problem as possible / find the faulty component, so that I can contact the hardware vendor already having an educated guess whats wrong.I know how that memcheck can help me to predict with high probability that the ram is no the cause.
EDIT: I have found out, how to pass the output to mcelog correctly: Put the output lines before "Run the above through 'mcelog --ascii'" in a file and invoke
mcelog
withsudo mcelog --ascii < file
One can see that the "Run the above through 'mcelog --ascii'" message is printed two times in the dmesg file, so I invoked mcelog two times beginning with "CPU:" and ending the line before the message (I left the dmesg stuff like "[ 3933.364173] mce: [Hardware Error]:" away).
So
mcelog
tells me:Hardware event. This is not a software error. CPU 4 BANK 3 TSC a0255fbd7f7 RIP !INEXACT! 10:ffffffff8171d9c2 MISC d62285 ADDR 42dd14480 TIME 1398357146 Thu Apr 24 18:32:26 2014 MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Data CACHE Level-1 Data-Read Error STATUS be00000000200135 MCGSTATUS 5 CPUID Vendor Intel Family 6 Model 58 RIP: _raw_spin_lock+0x12/0x50} SOCKET 0 APIC 1 microcode 15
and
Hardware event. This is not a software error. CPU 0 BANK 3 TSC a0255fbd7f0 RIP !INEXACT! 33:45a7992c1b5 MISC d62285 ADDR 42dd14480 TIME 1398357146 Thu Apr 24 18:32:26 2014 MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Data CACHE Level-1 Data-Read Error STATUS be00000000200135 MCGSTATUS 5 CPUID Vendor Intel Family 6 Model 58 SOCKET 0 APIC 0 microcode 15
so assuming that the motherboard is okay (as it has been changed) and if RAM is okay there is only the CPU left to be the troublemaker right? Is anyone familiar with all the output given?
-
Panther about 10 years
-
Esokrates about 10 yearsbodhi.zazen: As I stated in the text: I installed mcelog and the daemon is running, but there was no output stored in /var/log/mcelog. That article does not state anything that could help me here, but thanks for your comment.
-
Panther about 10 yearsI've not any experience with your problem, but as it seems to be a hardware problem, I am not sure there is a fix beyond replacing your hardware. You could file a bug against the linux kernel and see if the kernel developers can help.
-
psusi about 10 yearsI'd say you have a bad cpu.
-
Esokrates about 10 yearspsusi: is there a special reason you suspect the cpu?
-
-
Esokrates about 10 yearsHi, Thanks for your answer. Is there anything additional I can provide for you to have a quick look on it? The linux debug symbols were installed, I invoked the crash utility with the path to the symbols and ran bt. The output is attached in the initial question. A memcheck86+ passed without any errors, so it seems not to be the memory. Is swapping the memory modules still necessary? What is really strange with this issue is, that the laptop did not panic since I asked the question and paniced 4 times that day I asked the question (it cannot be overheating), so reproducing is not easy.
-
Esokrates about 10 yearsOkay crash happened again today (under windows, when my sister had to use a microsoft program). What do you mean by in depth analysis and what steps would it involve? The issue is strange, because the motherboard has been exchanged, memcheck86+ shows no errors and the Intel® Processor Diagnostic Tool passed successfully (under windows). If it was main memory, memcheck86+ would detect that, wouldn't it? How do I check the cpu cache? One thing out of interest: Could mce=3 boot option be a possible workaround for some time? What are the risks?
-
ppetraki about 10 yearsYeah... tests lie, well not deliberately, in so much that it has a strong opinion of the "testing model". So unit testing memory may not reproduce the issue because of other contributing factors that only running a full OS can provide. This is why the "linux kernel build" test is so good as it demands all sorts of memory and io in patterns and timing a unit test can't reproduce.
-
ppetraki about 10 yearsI was a little terse when I said "cpu cache", what I was implying is the silicon yield on the chip produced errors in that area which compromised it's memory. So component wise for troubleshooting the memory and cpu should be swapped for known good components.
-
ppetraki about 10 yearsthe consequences of ignoring uncorrectable machine checks is silent data corruption. im still of the opinion that the entire machine should be replaced under warranty. you already paid for it once, you shouldn't have to pay more to troubleshoot your own kit. Also the costs of those two components can easily amount to 30-40% of the total purchase price.