Drive is failing but LSI MegaRAID controller does not detect it

17,425

To see the RAID controller logs, run this command:

/opt/MegaRAID/MegaCli/MegaCli -AdpEventLog -GetLatest 1000 -f events.log -aALL

The events.log file contained entries like these which indicates a problem with the disk:

Code: 0x0000006e
Class: 0
Locale: 0x02
Event Description: Corrected medium error during recovery on PD 07(e0xfc/s2) at f04cb53
Event Data:
===========
Device ID: 7
Enclosure Index: 252
Slot Number: 2
LBA: 251972435


seqNum: 0x00004f65
Time: Wed Mar  6 05:36:48 2013

Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 07(e0xfc/s2) Path 4433221101000000, CDB: 28 00 0f 04 d1 f7 00 01 e0 00, Sense: 3/11/00
Event Data:
===========
Device ID: 7
Enclosure Index: 252
Slot Number: 2
CDB Length: 10
CDB Data:
0028 0000 000f 0004 00d1 00f7 0000 0001 00e0 0000 0000 0000 0000 0000 0000 0000 Sense Length: 18
Sense Data:
00f0 0000 0003 000f 0004 00d2 0046 000a 0000 0000 0000 0000 0011 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

seqNum: 0x00004f64
Time: Wed Mar  6 05:36:43 2013
Share:
17,425

Related videos on Youtube

nn4l
Author by

nn4l

Updated on September 18, 2022

Comments

  • nn4l
    nn4l over 1 year

    smartmontools reports an increasing number of unreadable sectors on a drive that is used in a RAID1 configuration. I thought that the LSI MegaRAID controller also checks the SMART status of its disk drives and therefore should recognize the drive as failing and should mark it as offline?

    Output from smartctl -d sat+megaraid,7 -a /dev/sda:

    ...
    197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       69
    ...
    Error 11 occurred at disk power-on lifetime: 9704 hours (404 days + 8 hours)
    When the command that caused the error occurred, the device was active or idle.
    
    After command completion occurred, registers were:
    ER ST SC SN CL CH DH
    -- -- -- -- -- -- --
    40 51 11 6f cd 04 0f  Error: UNC at LBA = 0x0f04cd6f = 251972975
    
    Commands leading to the command that caused the error were:
    CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
    -- -- -- -- -- -- -- --  ----------------  --------------------
    60 69 38 17 cd 04 40 00   2d+11:27:29.750  READ FPDMA QUEUED
    61 10 30 98 12 55 40 00   2d+11:27:29.750  WRITE FPDMA QUEUED
    61 01 28 57 86 da 40 00   2d+11:27:29.750  WRITE FPDMA QUEUED
    60 09 20 f7 d1 04 40 00   2d+11:27:29.750  READ FPDMA QUEUED
    60 80 18 00 d2 04 40 00   2d+11:27:29.750  READ FPDMA QUEUED
    ...
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Short offline       Completed without error       00%      9700         -
    # 2  Short offline       Completed without error       00%      9676         -
    # 3  Extended offline    Completed: read failure       90%      9673         251972659
    

    Output from MegaCli -AdpAllInfo -aAll:

    Product Name    : LSI MegaRAID SAS 9260-4i
    ...
    ================
    Virtual Drives    : 2
      Degraded        : 0
      Offline         : 0
    Physical Devices  : 5
      Disks           : 4
      Critical Disks  : 0
      Failed Disks    : 0
    

    Please advise whether the RAID controller behaviour is normal or whether there is a misconfiguration somewhere. The controller should be in its factory state, I have only configured the four physical disks as two RAID1 volumes.

    The bad disk will be replaced anyway.

    Update: I have learned that there is in fact a way to learn about this type of errors (see below), however I thought that this type of information would be shown in a more prominent status information, not buried in the log files.

    It seems that the RAID controller did not flag this disk because it could still recover from this error condition.