S.M.A.R.T - Predictive Failure Count

raid ibm drive-failure

21,773

Solution 1

There are errors on your disk. S.M.A.R.T. stands for Self-Monitoring, Analysis and Reporting Technology

The specific errors you mention correlate to mechanical degradation of the drive. You can possibly use this report to obtain a warranty replacement fomr IBM. The drive WILL eventually fail.

Solution 2

From a Seagate doc:

Predictive failures

S.M.A.R.T. signals predictive failures when the drive is performing unacceptably for a period of time. The firmware keeps a running count of the number of times the error rate for each attribute is unacceptable. To accomplish this, a counter is incremented each time the error rate is unacceptable and decremented (not to exceed zero) whenever the error rate is acceptable. If the counter continually increments such that it reaches the predictive threshold, a predictive failure is signaled. This counter is referred to as the Failure

History Counter. There is a separate Failure History Counter for each attribute.

Here's out to locate the faulty disk:

MegaCli -PdLocate -start -physdrv\[E:S] -aA

E : Enclosure
S : Slot
A : Adapter

Solution 3

The drive is physically failing at this point. The most important thing to worry about right now is having a good backup of your data, and a plan to get that drive replaced ASAP.

21,773

Bastien974

Updated on September 18, 2022

Comments

Bastien974 almost 2 years

I'm monitoring my IBM ServeRAID M5015 controller for RAID status with MegaCLI, I have this on one of the disk :

Enclosure Device ID: 252
Slot Number: 6
Enclosure position: 0
Device Id: 14
Sequence Number: 2
Media Error Count: 32
Other Error Count: 0
Predictive Failure Count: 18
Last Predictive Failure Event Seq Number: 8119
PD Type: SAS
Raw Size: 279.396 GB [0x22ecb25c Sectors]
Non Coerced Size: 278.896 GB [0x22dcb25c Sectors]
Coerced Size: 278.464 GB [0x22cee000 Sectors]
Firmware state: Online, Spun Up
SAS Address(0): 0x5000c50042c319c9
SAS Address(1): 0x0
Connected Port Number: 5(path0)
Inquiry Data: IBM-ESXSST9300653SS     B6336XN04HC10525B633
IBM FRU/CRU: 81Y9671
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive:  Not Certified
Drive Temperature :33 Celsius

What does this mean exactly ? I can't find an exact description, is there a way to have more details ? The RAID array has the Optimal state.

Media Error Count: 32

Predictive Failure Count: 18

Is there a way through the CLI to power-on the front LED so I physically know which disk I need to replace ?

Phil over 10 years

A related tip for future use; when inserting drives into any hotswap systems label the caddy with the serial number (remember to keep it up to date!). It's never bad to have this information on the front of the system.

Bastien974 over 12 years

Obviously I'm gonna change it, but I still want to know what those number mean, is it considered high ? what if it's not increasing any more ?
DanBig over 12 years

Its probably a read or write error. It will most likely rise as the drive continues to be used.