What does "single-bit ECC errors were detected on the RAID controller" mean?

raid memory dell-poweredge dell-perc ecc

18,996

Solution 1

This error corresponds to the cache module on the controller. At this point, you need to probably replace the RAM or the actual PERC controller. This should be standard warranty work.

Solution 2

The raid controller message “single bit error detected” is just informational. It is not a hardware error neither a proper warning to contact the manufacturer to request a fix. Most publicly available memory (RAM) do suffer errors at random (excluding military hardware). In a computing environment which this is unacceptable, a solution is provided: ECC. I believe it is the cheapest and simplest solution to detect single bit errors, and revert them. So a critical error is a more than 1 bit error message happening. This might require other techniques such as “ChipKill” (so the hardware-board can disable a chip which should no longer be trusted). A single-bit-error message when detected usually triggers an update to an internal hardware counter/registry. Just to keep some statistics. But they are not errors that justifies hardware replacement. This why ECC is built for.

The amount of single-bit-errors might vary. I have been interested in this subject for 16 years. And I've realized that the amount grows exponentially. This value correlates only to another parameter: the amount of time the system has been running (power-on-hours). The two thresholds that deserve mention are 18 months (the exponential curve ramp up) and 36 months (two bit errors start to appear). Other parameters have been analyzed, but there is no correlation whatsoever: brand, models, “cheap/expensive products”, heat, read/write operations. The key is just time (“power-on-hours”). This might also indicate the use of a “planned obsolescence” strategy applying to computing hardware. So capitalism system might require to renew computing hardware every 3 years, or up to 6 years (with a plus on the maintenance budget).

You also mention other errors which I believe are not directly related to the ECC issue (your question).

18,996

jsp

Updated on September 18, 2022

Comments

jsp almost 2 years
I have a Dell T7600 with a Perc H710P RAID controller and 4 attached 3TB drives. Over the past few months the RAID controller has been intermittently reporting errors on boot: "no boot device found", "adapter at baseport is not responding", disks frequently reported as missing or failed.

I have since replaced the RAID controller, the 4 hard drives, and finally the system's motherboard.

After replacing the motherboard and rebooting a few times, I got the error
```
Single bit ECC errors were detected on the RAID controller.
Please contact technical support to resolve this issue.
```
After rebooting about 20 more times, I haven't seen the ECC error. The system seems otherwise OK, except for the fact that the disk fans will sometimes start blowing at full blast when the the system is sitting completely idle and not stop until I reboot.

Are the ECC errors in memory on the RAID controller? Or, does the RAID controller map in system memory, and the ECC errors are really in system memory? Or, are the ECC errors in the 1GB cache that resides in the RAID controller?
Michael Hampton over 10 years

I didn't think the cache was replaceable on this particular controller as it appears to be soldered directly to the board.
jsp over 10 years

This will be the 4th time I've replaced this controller in 2 months. Would you happen to know if the controller is also responsible for controlling the disk fans?
HopelessN00b over 10 years

@jsp I'd definitely be looking for reasons why 4 controllers have failed in such short order. Bad power? Overheating? (Something else...?)
HopelessN00b over 10 years

95 to 100 degrees F isn't bad for being inside the thermal envelope of the card, no. I'd try the Dell OpenManage monitoring tools and look for voltage variances, or... well, anything out of range (or close) on or related to that controller card.