LSI MegaRAID : what does "transient error detected while communicating with PD : -:-:1" mean?

5,518

Apparently this error was due to the type of disks used. LSI responded to my support ticket with the following:

the SAMSUNG HD103UJ has not been qualified as a compatible hard drive. The error and subsequent time-out event is caused by a communication issue due to the error reporting mechanism used by desktop-level hard drives, which are not intended for RAID functionality.

I was not aware that this was an issue, but after having tested things more I belive this indeed must be the root of the issue. I've changed backplanes and SAS cables with no success, and I've carried out "stress" tests on both the OS virtual disk (using enterprise Dell disks) and the DATA disk (using desktop Samsung disks) and only when running the "stress" test on the DATA disks did i receive these errors.

So, I assume there's no other way around this issue than actually buying enterprise disks such as e.g. the "Western Digital® RE Enterprise 2TB" which is supported by LSI. So much for trying to reuse hardware.

UPDATE (March 11, 2013)

The controller runs with 2 arrays, a RAID1 using WD enterprise disks and a RAID6 using SAMSUNG desktop disks. This weekend the RAID1 array degraded. The log was flooded with the error message provided in my original post. The weird thing is that the RAID1 array use enterprise disks. Could it really be that there is an issue with one of the SAMSUNG disks on the other array, and then one of the WD disks gets evicted on the other array? That seems like an odd behaviour to me.

UPDATE (May 29, 2015)

It's been a while since I dealt with this issue. I believe the actual cause was linked to the power supply. I connected all 4 backplanes to the same power connector (using splitters). At peaks (in power consumption), disk would "fall out" as enough power could not be delivered. I fixed this by simply splitting two power connectors on two backplanes each.

Share:
5,518

Related videos on Youtube

sbrattla
Author by

sbrattla

Updated on September 18, 2022

Comments

  • sbrattla
    sbrattla almost 2 years

    I've got a LSI MegaRAID 9260-16i card running in a server, and it keeps logging the error

    Controller ID: 0 Transient error detected while communicating with PD: -:-:1
    

    I can't find anything about this message anywhere (documentation, google, forums etc.). What does this message mean?

    • the-wabbit
      the-wabbit over 11 years
      It means communication errors for one of your disks have occurred. Could you post the output of MegaCli -PDlist -Aall so we could see your physical drive config and the error counters?
    • sbrattla
      sbrattla over 11 years
      I did replace the disk with a new one, and the error message seems to be (temporarily?) gone. However, now I get quite a lot of "Controller ID: 0 Unexpected sense: PD = -:-:1-Power on, reset, or bus device reset occured [...]". The error level for those message are "Information", so it can't be that bad...but can I do anything about them?
    • sbrattla
      sbrattla over 11 years
      Ah, the "Transient error" appeared again. I don't have MegaCli available, but the LSI Storage Manager says that the "Media Error Count" is 0. I've tried with about 3 or 4 disks which all should be alright, and I keep getting the error. Could it be the cables or backplane which is acting up?
    • the-wabbit
      the-wabbit over 11 years
      likely, but how would you know which PD is :1? If your system is under warranty, you should contact technical support for a part replacement.
    • sbrattla
      sbrattla over 11 years
      Isn't PD referring to Physical Drive #1? I've assumed that it refers to that slot/drive, since that also happens to be the slot/drive which has gone offline several times and which the RAID has had multiple problems with...?
  • longneck
    longneck over 11 years
    SATA disks and RAID are not a good idea. See: serverfault.com/questions/452246/…
  • sbrattla
    sbrattla over 11 years
    @lonkneck: The post you linked to does not explicitly say that SATA disks in RAID is a bad idea. It says that consumer graded SATA disks in RAID is a bad idea (in other words, the same conclusion as I came to myself in my answer above). I do see that it was a rather dumb choice to go for consumer graded disks, but there might be a reason why I've chosen to go for SATA as opposed to SAS disks. In my case, the reason is price vs. storage capacity. SAS disks are expensive once you reach 1TB and 2TB disks. I need this size, and considering the price of SAS disks this is not an option.
  • longneck
    longneck over 11 years
    what you want are midline sas drives. sata-like prices with sas controllers.