How can I tell if a disk is failing on ESXi / what do these errors mean?

vmware-esxi drive-failure debian

12,775

Those errors do indeed indicate one of the SATA drives has encountered enough bad blocks to run out of block-reallocation space. That drive is bad, and needs to be replaced. I'm not sure where in the vSphere client that information is displayed, but the log entries are pretty clear.

If your hardware allows it, a hot-swap should be doable. Otherwise, you're going to have to shut everything down to perform the change-out. If that VM hasn't shut itself down after 30 minutes, it's time to just hard-terminate it. It's risky, but if it is truly hung there isn't much for it.

12,775

Josh

I am Josh Gitlin, CTO and co-founder of Digital Fruition a software as a service eCommerce company. Currently serving as Principal DevOps Engineer at Pinnacle 21, and hacking away at Cinc Server, the free-as-in-beer rebranded distribution of Chef Server.

Updated on September 18, 2022

Comments

Josh almost 2 years
I have a server running VMware ESXi v4.1.0 348481. It has a hardware RAID10 and a SATA backup drive. I have a VM running which has it's primary boot vmdk on the RAID10 datastore, and a 600 GB vmdk on the SATA backup drive's datastore. The VM runs Debian linux with the FreeBSD kernel, and uses ZFS for the backup drive.

EDIT: The drive is not directly attached to the VM. It is used as a VMware Datastore, and the VM has a vmdk on the SATA drive's datastore. The datastore is not full (only 65% full)

I logged in to the server using SSH and found that last night backup was hung, and zfs list or zpool list both hung. So I opened the virtual console in ESXi and was sad to see:
```
(da1:mpt0:0:1:0): READ(10). CDC: 28 0 19 97 3a 50 0 0 2d 0
(da1:mpt0:0:1:0): CAM status: SCSI Status Error
(da1:mpt0:0:1:0): SCSI status: Check Condition
(da1:mpt0:0:1:0): SCSI sense: MEDIUM ERROR info:4862ec asc:11,4 (Unrecovered read error - auto reallocate failed)
(da1:mpt0:0:1:0): READ(10). CDC: 28 0 19 97 3a 50 0 0 2d 0
(da1:mpt0:0:1:0): CAM status: SCSI Status Error
(da1:mpt0:0:1:0): SCSI status: Check Condition
(da1:mpt0:0:1:0): SCSI sense: MEDIUM ERROR info:4862ec asc:11,4 (Unrecovered read error - auto reallocate failed)
```
I tried to reboot the VM and I received a message that the system was going down for reboot, and then that hung. (^C appears but does not kill shutdown). I cannot interrupt or kill -9 the zpool list zfs list or rsync processes -- Nothing happens when I try.
1. Does this ndicate the backup SATA drive is failing? Or could this just be an ESXi error?
2. How in the vSphere client could I tell if the drive is failing? I didn't see any indication, everything under Hardware Health Status looks good, and I saw nothing under the Storage config.
3. How should I proceed from here? Should I just hard reboot the VM?
UPDATE: I just hard rebooted the VM. After it came back online, the backup zpool was online, however:
```
root@timestandstill:/home/jnet# zpool status -v
  pool: backup
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        backup      ONLINE       0     0     0
          da1       ONLINE       0     0     0


errors: Permanent errors have been detected in the following files:

        /backups/someserver/home/someuser/public_html/somedir/calendar/someuser/calendars/somefile.ics
```
I am leaning heavily towards replacing the drive...
Josh almost 13 years

Thanks @sysadmin1138. So, even though those errors show inside the VM, it clearly indicates a true hardware problem? I.E. it's not just some VMware quirk?
Josh almost 13 years

I should have clarified that. It is file-backed storage. The datastore has plenty of space: it's a 1TB drive and the VMDK is 600 GB, nothing else is on the disk. Either way, it's clearly something wrong, so unless someone else gives me a clear explanation of what this could be, I will replace the drive.