LVM is reporting I/O errors, but the disk reports no problems. Argh
Okay, I think the answer is that the COW space for the logical volume is full.
Using the command 'lvs' (which I just discovered), I see...
# lvs
/dev/dm-20: read failed after 0 of 4096 at 0: Input/output error
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
[...other LVs...]
newvm-cdrive mrburns Swi-I- 2.00G original-cdrive 100.00
[...other LVs...]
That capital 'S' at the start of the 'Attr' column means 'invalid Snapshot'. (A lower-case 's' would mean (valid) snapshot.) And as you can see, Snap% is 100, i.e., it's used all of its COW space.
Annoyingly, lvdisplay
doesn't provide this information, and it doesn't tell you that your snapshot logical volume is invalid. (All it says is that the snapshot status is 'INACTIVE', which I took as meaning 'not currently in use'.) And the lvs
command is not very widely advertised. And the error message ("Input/output error") isn't very helpful--in fact there were no log messages or error messages which suggested 'snapshot is full'. (Later versions of LVM2 write messages to /var/log/messages when the space is starting to fill up, but the version in Debian Lenny doesn't. Boo.)
And to compound the problem, there's no discussion of this on the internet (or at least, not that I could find)!
I did wonder why COW snapshots can't be fixed by just adding more space to the LV (using lvextend
, but, actually, the COW space will be required not only when you write to the snapshot destination, but also when you write to the snapshot source. So once your COW area is filled up, any writes to the source LV must necessarily make the snapshot LV invalid, and not easily recoverable.
Related videos on Youtube
andrewf
Updated on September 17, 2022Comments
-
andrewf almost 2 years
I've started seeing errors reported by LVM on certain Logical Volumes (and by Xen when attempting to create virtual machines on these LVs). But I've run tests on the disk, and can't see any hardware problems.
We're running a XEN/Linux (Debian Lenny) box here, running off a single SATA disk managed with LVM2. It's been in and running for more than a year, with the only major changes being recent apt-get upgrade of the kernel.
# uname -a Linux hostname 2.6.26-2-xen-amd64 #1 SMP Thu Sep 16 16:32:15 UTC 2010 x86_64 GNU/Linux
The errors appear like this:
# vgck /dev/dm-20: read failed after 0 of 4096 at 0: Input/output error
And then when I try to start the VM which uses that LV for its C-drive (it's a Windows virtual machine), the VM refuses to start and I see this at the end of the
/var/log/xen/qemu-dm-*.log
log file:... Register xen platform. Done register platform. raw_read(6:/dev/vgroup/newvm-cdrive, 0, 0x7fff02bca520, 512) [20971520] read failed -1 : 5 = Input/output error I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0 raw_read(6:/dev/vgroup/newvm-cdrive, 0, 0x12dfff0, 512) [20971520] read failed -1 : 5 = Input/output error
This first happened on 2 VMs whose disk was based on a snapshot of a third, original VM. I nuked the 2 LVs and recreated them (again by snapshotting the same, original VM's LV), and they've been fine since.
However, today I attempted to create a new VM. I snapshotted the same, orginal VM's LV (
lvcreate -L500M --snapshot --name newvm-cdrive /dev/vgroup/original-cdrive
), and created the new VM. It initially worked, but after shutting down the VM once, it refuses to start up again, with the errors shown above.My obvious first guess would be physical problems with the drive, but smartmon does not report anything:
# smartctl -t long /dev/sda # [later] # smartctl -l selftest /dev/sda smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 1 - # 2 Short offline Completed without error 00% 0 -
Also, not getting any errors from
badblocks
.I've tried running
vgck
andpvck
:# vgck vgroup -v Using volume group(s) on command line Finding volume group "vgroup" /dev/dm-20: read failed after 0 of 4096 at 0: Input/output error # pvck /dev/sda2 Found label on /dev/sda2, sector 1, type=LVM2 001 Found text metadata area: offset=4096, size=192512
Have found a few references to this error message ("read failed after 0 of 4096 at...") on the Interwebs, but nothing which seems to apply to my situation.
Any ideas?
Update: As requested, below is output of lvdisplay and ls -l. Running out of COW space is plausible. How do I tell?
# lvdisplay /dev/vgroup/newvm-cdrive /dev/dm-20: read failed after 0 of 4096 at 0: Input/output error --- Logical volume --- LV Name /dev/vgroup/newvm-cdrive VG Name vgroup LV UUID jiarxt-q2NO-SyIf-5FrW-I9iq-mNEQ-iwS4EH LV Write Access read/write LV snapshot status INACTIVE destination for /dev/vgroup/original-cdrive LV Status available # open 0 LV Size 10.00 GB Current LE 2560 COW-table size 200.00 MB COW-table LE 50 Snapshot chunk size 4.00 KB Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 254:20 # ls -l /dev/dm-20 brw-rw---- 1 root disk 254, 20 2010-10-11 15:02 /dev/dm-20
And here's fdisk -l.
# fdisk -l /dev/sda Disk /dev/sda: 160.0 GB, 160000000000 bytes 255 heads, 63 sectors/track, 19452 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x00000080 Device Boot Start End Blocks Id System /dev/sda1 * 1 31 248976 83 Linux /dev/sda2 32 19452 155999182+ 8e Linux LVM
-
bindbn over 13 yearsadd please
fdisk -l
-
MastaJeet over 13 yearsMaybe your snapshots are running out of space. What is the output of
lvdisplay
? Also can you give the major/minor numbers fordm-20
(ls -l /dev/dm-20
). -
andrewf over 13 years@embobo, I just tried "lvextend +1.8G /dev/vgroup/newvm-cdrive", to expand the snapshot COW space. The extend succeeded, but I'm still seeing the same errors. (Maybe LVM just gives up in confusion once it's run out of COW space and adding more has no effect?)
-
andrewf over 13 yearsOkay, just discovered the "lvs" command, and it says that the logical volume has a Snap% of 100.00... so I'm guessing you're right, @embobo. So I guess the question now becomes: is that the LV stuffed? Should I just nuke it and recreate, or can I recover it?
-