RAM tests inconsistently - what is the most likely culprit? (i.e. what should I spend money on replacing)

12,491

Solution 1

This doesn't sound like any component is defective, rather you are using an incompatible combination.

Having multiple sockets on the same memory bus populated increases the capacitance on each data line and slows down the rise time, which can cause transitions to arrive late and be misdetected. This phenomenon is known to electrical engineers as "fan-out".

This is further complicated because of the fan-out internal to a memory module. The number and topology of the DRAM devices on the module, called "rank", will affect how many modules you can successfully connect in parallel.

Server motherboards supporting a lot of memory sockets actually require buffered memory, which uses a cascading network of buffers to limit the fan-out (and therefore capacitance) seen by each one. There's delay caused by the buffers themselves, but it only increases logarithmically with the number of loads, whereas for unbuffered memory capacitance increases linearly.

Wikipedia discusses this: https://en.wikipedia.org/wiki/Memory_rank

Some motherboard manuals actually call this sort of thing out. For others you can deduce the information from the RAM compatibility lists. As an example, the ASUS Z170-A motherboard shows that dual rank (called DS = double sided in the manual) can only be used in two slots at once on that board, as opposed to the ability to use four single rank DIMMs at once.

enter image description here

Solution 2

That sounds like an issue in the processor's integrated memory controller.

In modern systems, motherboards don't really play a role in memory management beyond just providing a path between the memory modules and the processor. Memory is directly connected to the processor to minimize latency; the "northbridge" that connects the memory to the processor in older systems is now part of the processor itself. (The firmware or PCH may control how the processor runs the RAM, but it doesn't make sense for it to cause bit errors of the sort you describe as it's ultimately the responsibility of the processor.) Hence, the very first thing I'd suspect in a situation like this is a faulty IMC.

In fact, I'd be very surprised if the motherboard or system firmware were to blame for the problems you're experiencing.

Solution 3

I see some bad reviews for the BIOS on that motherboard. I would start by checking for a BIOS update. Never skimp on the motherboard.

Solution 4

It's possible that the RAM could be faulty as well, even though it may not appear to be. I had a recent issue with my home server involving a fatal mishap with some iced tea...

I went through the entire process of replacing each part individually (2 CPUs, mobo, powersupply, and 2 banks of 16 GB (2x8GB) RAM) and everything tested fine when I just used a single bank of RAM with a single CPU (except for 1 CPU which was toast).

It didn't matter which configuration I used, it always worked when I had a single CPU and bank of RAM (whether it was 16GB or 32GB of RAM), but when I put in the 2nd CPU and split the RAM so it was 16GB per bank, the server failed to boot.

It wasn't until I replaced one bank of RAM completely that it finally booted and ran properly, and has been ever since.

tl;dr: As @moab stated in his comment, you can never tell for certain until you test every component in a compatible system

Share:
12,491

Related videos on Youtube

fdmillion
Author by

fdmillion

Updated on September 18, 2022

Comments

  • fdmillion
    fdmillion almost 2 years
    • Motherboard: GA-B85M-DS3H-A
    • CPU: Core i5 4430
    • RAM: PNY XLR8 DDR3 32GB (4x8GB) 1600MHz (MD32768K4D3-1600-X9)
    • PSU: EVGA 500 W1 80+

    The Problem

    With all 32GB of RAM installed, the system fails MemTest86+ 6.2 consistently. The failure always occurs during the first pass, and the errors quickly rise in to the millions of errors. Attempting to run Windows results in random reboots and Stop errors (as would be expected with RAM errors).

    What I've Tried

    • Test a single 8GB PNY module in socket DIMM1. Successfully completes 4 passes of MemTest.
    • Test a single 8GB PNY module in socket DIMM2. Successfully completes 4 passes of MemTest.
    • Test a single 8GB PNY module in socket DIMM3. Successfully completes 4 passes of MemTest.
    • Test a single 8GB PNY module in socket DIMM4. Successfully completes 4 passes of MemTest.
    • Test all four 8GB PNY DIMMs separately, individually, in socket DIMM1. All modules successfully complete 4 passes of MemTest.
    • Test two 8GB PNY modules in sockets DIMM1 and DIMM2. Successfully completes 4 passes of MemTest.
    • Test two 8GB PNY modules in sockets DIMM3 and DIMM4. Successfully completes 4 passes of MemTest.
    • Test the motherboard with four 2GB known-good DIMMs in all sockets. Successfully completes 4 passes of MemTest.
    • Swap the ordering of the PNY DIMMs in the sockets. No change - MemTest errors still occur.
    • Raise the motherboard RAM voltage from 1.5v to 1.65V. No change - MemTest errors still occur.
    • Play with various combinations of the RAM manual settings in the setup utility - enabling/disabling XMP profile, setting "increased stability" preset, etc. No change, MemTest errors still occur.

    I think I can safely rule out bad RAM and bad RAM sockets. The only time the MemTest tests fail is if all four 8GB modules are installed simultaneously.

    I've measured voltages coming off the PSU and everything there appears stable even with all four sticks installed.

    As I write this, I have tried a last resort option of manually reducing the RAM speed to 1066MHz in the BIOS. So far, MemTest has completed one pass and is on its second with no errors. (All the above tests were performed at the native 1600MHz RAM speed.) This may allow me to use the system, albeit with slightly slower RAM speeds, but this does not seem to be a permanent fix.

    Whenever MemTest errors occur, they always occur in the same exact position on the 64-bit address bus:

    Bit Error Mask: 00000000FF000000
    

    Additionally, errors NEVER occur below the 4GB barrier. In other words, all errors occur in the address space between 4GB and 32GB.

    I'm deducing this to be some sort of strange interaction or timing problem with the CPU and the RAM and the motherboard, since the errors are very consistent, only occur in one specific configuration, appear to be mitigated by slowing down the RAM, and only occur above the 4GB barrier. My question is: Is it more likely that my CPU or my motherboard is the culprit?

    I have been intending to upgrade this machine to a Core i7-4790K, so if the CPU is the likely culprit (I know that the memory controller is on the CPU in these newer models) then it works out good because I am planning to upgrade it anyway, but I'm wondering if there's a chance that the motherboard itself might also be part of the problem. i.e. I would not want to spend the money on the i7 CPU only to experience the exact same problem and find out I also have to replace the motherboard...

    Advice?


    EDIT: The slower RAM speed still produced errors, but only once the test reached the third pass. I restarted the test with only one CPU active just to test for an interaction on the CPU itself.

    • Moab
      Moab about 8 years
      Only way to confirm if it is memory, mobo or cpu is to test ram in another compatible system.
    • Joshua
      Joshua about 8 years
      If the problem doesn't move when you move RAM chips, motherboard is tosser.
    • Psycogeek
      Psycogeek about 8 years
      When your running this memory in dual, or when you have 4 (8g) modules in, you possibly could take it off SPD (auto) and tweak the timings a bit and get it to work. say it is 10,11,10,24 tune it to 11,12,11,32 and test like that instead. (yes this is guessing) If that works 100% non-stop, then it is less likely to be a heat issue or a motherboard problem. People with 4x8gig modules have had problems you describe before, if there is voltage regulation support, and the cpu has no bent pins, it can be a way to get stuff not on the compatability list to work. so test that and get back to us.
    • Psycogeek
      Psycogeek about 8 years
      "The slower RAM speed still produced errors, but only once the test reached the third pass" During any of this are you taking extra steps to test cooling of the ram? Even a temporary added fan or external fan moving air across the ram and its voltage regulation curcuit stuff, could test to see if heat is one of the issues.
    • amziraro
      amziraro about 8 years
      @Psycogeek +1 for suggesting a timing modification. Some RAM modules don't play nice with others as far as timings go (even the same brand or module type). I have had a similar problem to OP and solved by setting timings manually.
    • Reyssor
      Reyssor about 8 years
      Reminds me of an issue Supermicro had some times ago with their X10 motherboards when using a combination of 4 x 8 GB of a specific Kingston RAM after they switched to another supplier for some component. I don't think there has ever been a final public word from the companies about the exact source of the problem, only rumors.
    • user
      user about 8 years
      When you move the DIMMs around but keep all four installed and see errors, do the error locations change only as you change the physical DIMM locations, or are the error locations essentially random, or do the error locations remain the same regardless of the order in which the physical DIMMs are installed?
    • Monstieur
      Monstieur about 8 years
      Have you increased the VCCIO voltage for Haswell? With 4 memory sticks running at the maximum supported speed, a voltage bump to the memory controller may be required.
    • Taegost
      Taegost about 8 years
      One thing that isn't mentioned anywhere: Has this configuration ever worked successfully?
  • fdmillion
    fdmillion about 8 years
    BIOS is current. Admittedly the RAM is not on the "qualified" list, but it has the same timings as plenty of other modules listed there.
  • Michael Hampton
    Michael Hampton about 8 years
    What about a bent pin?
  • Atoadaso
    Atoadaso about 8 years
    I would look into replacing the motherboard then. It doesn't have to be top of the line, just start with a price range you can afford and look for the ones with the most reviews (read them too). Those with the biggest user base are a lot more likely to have long-term support for BIOS and chipset updates.
  • Ben Voigt
    Ben Voigt about 8 years
    @Michael: A bent pin would result in failures testing individual modules also.
  • brhans
    brhans about 8 years
    Assuming this is the cause of the problem, would it help to turn off SPD and tweak the timing settings a little slower to compensate for the slower rise/fall times?
  • bwDraco
    bwDraco about 8 years
    I'm not sure whether this is actually correct. Consumer Haswell processors generally support four memory ranks per channel, which is enough to allow four double-sided modules in two memory channels. Why would this be the issue? This also doesn't seem to explain the fact that the problems only happen above the 4 GB barrier. Furthermore, the motherboard's manual states that the underlying B85 chipset supports 32 GB of memory and does not mention any limitation regarding the number of memory ranks.
  • Ben Voigt
    Ben Voigt about 8 years
    @bwDraco: Even though the memory controller is on the CPU, the motherboard also matters. The PCB layout can affect it, suboptimal length matching will decrease the phase margin on the signals (this is also why errors correlate to certain bytes or bit positions). That the motherboard manual doesn't talk about ranks doesn't mean that all combinations are supported, it just means it's a crap manual that doesn't go into detail.
  • Ben Voigt
    Ben Voigt about 8 years
    @brhans: It's not the timing parameters that matter, but the memory clock frequency, because the problem is in the transfer between the CPU and DIMMs, not internal to the DRAM. SPD usually has a number of profiles corresponding to different clock frequencies, choosing a different one of these would be better than going fully manual.
  • alex.forencich
    alex.forencich about 8 years
    Definitely seems like a motherboard signal integrity issue. The larger modules could have higher capacitance per pin than the smaller modules, especially if the modules themselves are dual rank. This could cause exactly this issue when you fully populate the ranks. It is possible for a module to have more than one rank. So four ranks per channel could easily be two dual-rank high density modules. This could be exacerbated by the electrical characteristics and routing of the traces on the motherboard. My suggestion: try another motherboard.
  • milesrf
    milesrf about 8 years
    Have you checked if that motherboard is even able to handle 32 GB of memory at once properly? Also, you could find the memory manager chip on the motherboard and look up how much memory it is expected to be able to handle properly.