How can I tell if a disk is failing on ESXi / what do these errors mean?
Those errors do indeed indicate one of the SATA drives has encountered enough bad blocks to run out of block-reallocation space. That drive is bad, and needs to be replaced. I'm not sure where in the vSphere client that information is displayed, but the log entries are pretty clear.
If your hardware allows it, a hot-swap should be doable. Otherwise, you're going to have to shut everything down to perform the change-out. If that VM hasn't shut itself down after 30 minutes, it's time to just hard-terminate it. It's risky, but if it is truly hung there isn't much for it.
Related videos on Youtube
Josh
I am Josh Gitlin, CTO and co-founder of Digital Fruition a software as a service eCommerce company. Currently serving as Principal DevOps Engineer at Pinnacle 21, and hacking away at Cinc Server, the free-as-in-beer rebranded distribution of Chef Server.
Updated on September 18, 2022Comments
-
Josh almost 2 years
I have a server running VMware ESXi v4.1.0 348481. It has a hardware RAID10 and a SATA backup drive. I have a VM running which has it's primary boot vmdk on the RAID10 datastore, and a 600 GB vmdk on the SATA backup drive's datastore. The VM runs Debian linux with the FreeBSD kernel, and uses ZFS for the backup drive.
EDIT: The drive is not directly attached to the VM. It is used as a VMware Datastore, and the VM has a vmdk on the SATA drive's datastore. The datastore is not full (only 65% full)
I logged in to the server using SSH and found that last night backup was hung, and
zfs list
orzpool list
both hung. So I opened the virtual console in ESXi and was sad to see:(da1:mpt0:0:1:0): READ(10). CDC: 28 0 19 97 3a 50 0 0 2d 0 (da1:mpt0:0:1:0): CAM status: SCSI Status Error (da1:mpt0:0:1:0): SCSI status: Check Condition (da1:mpt0:0:1:0): SCSI sense: MEDIUM ERROR info:4862ec asc:11,4 (Unrecovered read error - auto reallocate failed) (da1:mpt0:0:1:0): READ(10). CDC: 28 0 19 97 3a 50 0 0 2d 0 (da1:mpt0:0:1:0): CAM status: SCSI Status Error (da1:mpt0:0:1:0): SCSI status: Check Condition (da1:mpt0:0:1:0): SCSI sense: MEDIUM ERROR info:4862ec asc:11,4 (Unrecovered read error - auto reallocate failed)
I tried to reboot the VM and I received a message that the system was going down for reboot, and then that hung. (^C appears but does not kill
shutdown
). I cannot interrupt orkill -9
thezpool list
zfs list
orrsync
processes -- Nothing happens when I try.- Does this ndicate the backup SATA drive is failing? Or could this just be an ESXi error?
- How in the vSphere client could I tell if the drive is failing? I didn't see any indication, everything under Hardware Health Status looks good, and I saw nothing under the Storage config.
- How should I proceed from here? Should I just hard reboot the VM?
UPDATE: I just hard rebooted the VM. After it came back online, the backup zpool was online, however:
root@timestandstill:/home/jnet# zpool status -v pool: backup state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM backup ONLINE 0 0 0 da1 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: /backups/someserver/home/someuser/public_html/somedir/calendar/someuser/calendars/somefile.ics
I am leaning heavily towards replacing the drive...
-
Josh almost 13 yearsThanks @sysadmin1138. So, even though those errors show inside the VM, it clearly indicates a true hardware problem? I.E. it's not just some VMware quirk?
-
Josh almost 13 yearsI should have clarified that. It is file-backed storage. The datastore has plenty of space: it's a 1TB drive and the VMDK is 600 GB, nothing else is on the disk. Either way, it's clearly something wrong, so unless someone else gives me a clear explanation of what this could be, I will replace the drive.