Random restarts caused by a machine check exception

kernel-panic mcelog

12,995

Solution 1

This issue has to do with a hardware failure, specifically it looks like the memory in bank 4, (DIMM 4 - I would assume), is faulty. The MCE facility (Machine Check Events) is not widely known about but I"ve answered several questions on the site related to it.

Does kernel: EDAC MC0: UE page 0x0 point to bad memory, a driver, or something else?
OS errors : kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 [duplicate]

Additionally you can write your own rules for MCE in this U&L Q&A titled: Writing triggers for mcelog.

Also if you go through the MCE's FAQ, item #6 shows you how to make use of the mcelog --ascii command, titled: How do I "run through mcelog --ascii"?. Basically you're suppose to save the panic message in a text file and then run it through the mcelog command like so:

$ mcelog --ascii < file

How can I fix this?

Option #1

You essentially have 3 options. I won't go into describing the first, which is to replace the RAM DIMM in slot 4.

Option #2

The second option would be to further diagnose the issue and confirm that it's actually a faulty DIMM. You can use memtest86+ to do this. Along with performing this test, I would also take a minute and re-seat the DIMMS to make sure they're making a good contact within their slots on your motherboard, if you feel comfortable doing such a thing. It's actually quite trivial to do this.

Option #3

The third option would be to attempt to blacklist the location, assuming it's isolated to a specific couple of addresses within the DIMM. Believe it or not you can actually blacklist specific memory addresses. I've also explained how to do this on this site as well, titled: How to blacklist a correct bad RAM sector according to MemTest86+ error indication?.

Solution 2

Update all software. If you have any non-official software installed (video drivers, ...) get rid of them for now. Then try again. Specially nVidia drivers are famous for causing instability, and Windows drivers used though ndiswrapper work mostly by mistake.

Random crashes (if the output isn't the same each time) are usually result of overheating somewhere (bad fans, dried heat paste, airflow obstructed by dust bunnies/clogged airways). I have also seen such when RAM or other components wheren't firmly seated.

It could be due to bad RAM, run memtest (it might be an option in your boot menu). Yes, this takes a very long time. Other hardware problems are more remote possibilities.

Solution 3

The MCE Error (b200000000100402) is an "MCA: Internal unclassified error: 402". So it doesn't have to do with memory, or at least that cannot be stated. It's hardware related, as you can see in the decoded error below:

The kernel log indicates that hardware errors were detected.
System log may have more information.
The last 20 mcelog lines of system log are:
==========================================
Mar 28 01:59:27 900x3c mcelog: Hardware event. This is not a software error.
Mar 28 01:59:27 900x3c mcelog: MCE 0
Mar 28 01:59:27 900x3c mcelog: CPU 0 BANK 4
Mar 28 01:59:27 900x3c mcelog: TIME 1395968361 Fri Mar 28 01:59:21 2014
Mar 28 01:59:27 900x3c mcelog: MCG status:
Mar 28 01:59:27 900x3c mcelog: MCi status:
Mar 28 01:59:27 900x3c mcelog: Uncorrected error
Mar 28 01:59:27 900x3c mcelog: Error enabled
Mar 28 01:59:27 900x3c mcelog: Processor context corrupt
Mar 28 01:59:27 900x3c mcelog: MCA: Internal unclassified error: 402
Mar 28 01:59:27 900x3c mcelog: STATUS b200000000100402 MCGSTATUS 0
Mar 28 01:59:27 900x3c mcelog: MCGCAP c07 APICID 0 SOCKETID 0
Mar 28 01:59:27 900x3c mcelog: CPUID Vendor Intel Family 6 Model 58

Furthermore, in Kernel Bug 839511 the same error is triggered. It was solved by changing motherboard and CPU.

12,995

fhucho

Updated on September 18, 2022

Comments

fhucho almost 2 years

My laptop restarts randomly about twice a day. It shows the following error log before the restart.

.

Unfortunately I don't have an idea how to decode the Machine Check Exception (MCE). mcelog --ascii outputs nothing. Is there a chance that this is a software problem?

The laptop is Samsung NP900X3C with the Intel Core i5-3317U processor. I use Arch Linux with the 3.13.5 kernel.
- Admin over 9 years
  
  Unfortunately no :/ I get a restart about twice a day, totally random.
- Admin over 9 years
  
  I think it might be the kernel version maybe? It started happening only some time after I bought the notebook, so perhaps some kernel update caused it. When did it start happening to you?
- Admin over 9 years
  
  I'm sorry again, but trying to answer fhucho by commenting further, I get a "you must have 50 reputation for comments". This site is not very friendly for newcomers, which may actually be the point, I guess against spammers... But it doesn't help in this case. Anyway, my address is frigaut at gmail.com. fhucho, please email me directly, it's gonna be difficult to exchange information here.
- Admin over 9 years
  
  @FrancoisRigaut - unfortunately sites have to take a defensive position against spammers and such and so can be a little bit uninviting until you've accumulated 50 rep.It's just how it has to be, and sorry for any inconveniences.
- Admin over 9 years
  
  @FrancoisRigaut suggested over email that updating the problem might help. I tried it and the restarts seem to be less frequent and the error messages are different.
fhucho over 10 years

Thanks. After further investigation, it's also possible that this is caused by memory regions disabled by UEFI. Do you have an idea what the b200000000100402 part means?
fhucho over 10 years

Can I determine the memory part from the log? Memtest didn't find any problem.
slm over 10 years

@fhucho - we might want to make this a separate Q. Did you run mcelog to get it's output? See here for another example as well: advancedclustering.com/faq/…
slm over 10 years

@fhucho - however based on this info: mcelog.org/bios-support.html, I'm assuming that you cannot go much further then knowing which bank was bad using the mcelog info. You're only way to determine the bad location is to use memtest86+ as described in my other A's.
fhucho over 10 years

This is the output of mcelog: gist.github.com/fhucho/a6de33934fa16c7628d2
fhucho over 10 years

memtest86+ didn't show any errors after 1 pass, I'll try more passes later.
jose.padilla about 10 years

@flucho When this error start to trigger? Had this error happened before? (with another kernel version, or something else). As for me, it didn't happen so frequent with Ubuntu 13.10 (kernel version 3.11.x). But it does happen with Fedora 20 (kernel version 3.13.7). Furthermore, I cannot reproduce this error when running on battery, do you?
jose.padilla about 10 years

I updated BIOS from P06AAC to P07AAC. I cannot reproduce this error with Windows 7. I tried high CPU loads and nothing happens. I figured out it had to do with heat, but it doesn't. I will try Fedora 20 again.
fhucho almost 9 years

I think I've tried this and it didn't work, but I'm no 100% sure. Let me know if it work for you.