MCE error: MCA: Internal parity error

linux hardware intel stability

8,804

Solution 1

While mcelog does some decoding of the MCA status register, more might be helpful.

Step 1

Download the combined Intel® 64 and IA-32 Architectures Software Developer Manuals from http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html It's massive at 3439 pages. The below refers to the September 2014 version.

Step 2

Take the STATUS word from /var/log/mcelog and pipe it through xxd a few times to get a bit field. For mine, this is:

$ echo "9000004000010005" | xxd -r -p | xxd -b
0000000: 10010000 00000000 00000000 01000000 00000000 00000001  ...@..
0000006: 00000000 00000101                                      ..

Step 3

Do some text manipulation and then number the bits:

66665555 55555544 44444444 33333333 33222222 22221111 111111 
32109876 54321098 76543210 98765432 10987654 32109876 54321098 76543210
-----------------------------------------------------------------------
10010000 00000000 00000000 01000000 00000000 00000001 00000000 00000101

Step 4

Pull the status MCi status register bit definition from Section 15.3.2.2 of the manual:

MCi status register bits

In my case, bits 3:0 are saying "MCA Error Code 5" which is what mcelog has already interpreted for me as "Internal parity error" (see section 15.9.1). What I'm hoping for is more information - is the CPU, RAM or Motherboard the likely cause of the parity error?

The 1 in bit 63 just means "this register value is valid". The 1 in bit 60 just means "error reporting is enabled". The value of [52:38] = 1 means one error has been corrected.

The 1 in bit 16 looks promising since it's sitting in the "Model Specific Error Code" field but, alas, according to section 16, bit [15] being equal to 0 means all I get is a 'simple' (not compound) error, so I'm done.

Bottom line: Can't tell if the parity error is from cache memory or system memory. Can't tell what "internal" means. Internal to what? So I swapped memory, same problem, then swapped CPU with another machine (got lucky, compatible sockets) and the problem stopped... on both machines. Not exactly the pinpoint diagnostic help I was hoping for from this advanced hardware, and I don't understand why the "bad" CPU is happy in another machine, but problem solved.

Solution 2

Possibly related to Intel Errata HSW131 (or similar) which is spurious and harmless MCA 05 (Internal parity error) errors.

Solution: Ignore.

Solution 3

I'm running a Linux box with Intel i5-3550 (Ivy bridge) and I did get this issue for a while (same exact status value), although it only affected cores number 2 and 3 (and mostly only 2), so I disabled them for a few weeks assuming the hardware was most likely dying.

I had noticed that average running temperatures were higher than usual though, but after cleaning it up a bit the issue persisted. It manifested itself not only through MCE error messages but also as unpredictable segmentation faults and crashes in running processes.

Well, it turns out that for some inscrutable reason UEFI decided to clock the CPU up to ~4.1 GHz when in Turbo mode - when the specs say it only goes up to 3.7 GHz. Manually reconfiguring those limits appears to have solved the problem.

TL;DR: to anyone reading this, check for overclocking as well.

Solution 4

Are you running processes on 32 bits ???

Please find some details searching "internal parity error" within:

http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

HSD55. Internal Parity Errors May Incorrectly Report Overflow in The IA32_MCi_STATUS MSR
    Problem:
    Due to this erratum, uncorrectable internal parity error reports with an
    IA32_MCi_STATUS.MCACOD (bits [15:0]) value of 0005H and an
    IA32_MCi_STATUS.MSCOD (bits [31:16]) va
    lue of 0004H may incorrectly set the
    IA32_MCi_STATUS.OVER flag (bit 62) indicating an overflow even when only a single
    error has been observed.

I've the same issues on Haswell i7-4790 (fourth generation) running 32 bits Linux KVM machines on CentOS 7 (x64)

http://ark.intel.com/products/80806/Intel-Core-i7-4790-Processor-8M-Cache-up-to-4_00-GHz

View more solutions

8,804

Greg Bell

Updated on September 18, 2022

Comments

Greg Bell almost 2 years
I have an unstable machine running Ubuntu 14.04 LTS which passes 9 hours of memtest86.

I get these:
```
Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 0 
TIME 1414735539 Fri Oct 31 17:05:39 2014
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 9000004000010005 MCGSTATUS 0
MCGCAP c09 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 58`
```
This is when the machine keeps going. I don't yet have one for when the machine freezes.

What's "MCE 0"? And "MCA"? And am I looking at a CPU error or a RAM error?

I've got one stick of 8 GB of RAM.

What is the order I should replace the hardware (RAM, CPU, Motherboard, power supply)? The machine used to be stable. Should I up the CPU voltage a bit?

I've read the mcelog FAQ. Google results are sparse, and most have other formats of similar messages (ie. old versions of the kernel/MCE maybe).
- wurtel over 9 years
  
  A single bit error can happen infrequently, that's why servers have parity memory. If this happens a lot then there's a problem. I'd begin by replacing the RAM; perhaps just reseating the DIMM might help. Increasing the RAM voltage (just a little bit) may also help. I once had a motherboard where the voltage controller was decaying, every week I needed to increase the RAM voltage to get it to boot, in the end it couldn't go up any higher and I replaced the motherboard.
- Greg Bell over 9 years
  
  Yeah, the problem here is that I've just started watching mcelog's output because of the freezes. I'm trying to catch what error causes the actual freeze, and this one wasn't it. But am I looking at a cache memory parity error or one from system memory?
- Gilles 'SO- stop being evil' over 9 years
  
  MCA, MCE
- slm over 9 years
  
  Take a look at this Q&A as well: unix.stackexchange.com/questions/117449/…
- Greg Bell over 9 years
  
  The section covering MCA in Intel's Software Developer Manual for the Intel 64 and IA-32 chips is HUGE. Section 15 covers MCA. intel.com/content/dam/www/public/us/en/documents/manuals/… I've done low-level development, so I can decode 64-bit words and all, but is this what's required to figure out what hardware I should replace?
Greg Bell over 9 years

Thanks for your answer. Yes, running on 32-bits, but the system was very unstable. Plus the MSCOD was 0x1, not 0x4.
myset over 8 years

I solve all MCE errors on my KVM box. When I changed processor of x86 machines from KVM32 bits to Haswell or some other from Intel family.