LVM is reporting I/O errors, but the disk reports no problems. Argh

13,786

Okay, I think the answer is that the COW space for the logical volume is full.

Using the command 'lvs' (which I just discovered), I see...

# lvs
/dev/dm-20: read failed after 0 of 4096 at 0: Input/output error
LV             VG      Attr   LSize   Origin          Snap%  Move Log Copy%  Convert
[...other LVs...]
newvm-cdrive   mrburns Swi-I-   2.00G original-cdrive 100.00
[...other LVs...]

That capital 'S' at the start of the 'Attr' column means 'invalid Snapshot'. (A lower-case 's' would mean (valid) snapshot.) And as you can see, Snap% is 100, i.e., it's used all of its COW space.

Annoyingly, lvdisplay doesn't provide this information, and it doesn't tell you that your snapshot logical volume is invalid. (All it says is that the snapshot status is 'INACTIVE', which I took as meaning 'not currently in use'.) And the lvs command is not very widely advertised. And the error message ("Input/output error") isn't very helpful--in fact there were no log messages or error messages which suggested 'snapshot is full'. (Later versions of LVM2 write messages to /var/log/messages when the space is starting to fill up, but the version in Debian Lenny doesn't. Boo.)

And to compound the problem, there's no discussion of this on the internet (or at least, not that I could find)!

I did wonder why COW snapshots can't be fixed by just adding more space to the LV (using lvextend, but, actually, the COW space will be required not only when you write to the snapshot destination, but also when you write to the snapshot source. So once your COW area is filled up, any writes to the source LV must necessarily make the snapshot LV invalid, and not easily recoverable.

Share:
13,786

Related videos on Youtube

andrewf
Author by

andrewf

Updated on September 17, 2022

Comments

  • andrewf
    andrewf almost 2 years

    I've started seeing errors reported by LVM on certain Logical Volumes (and by Xen when attempting to create virtual machines on these LVs). But I've run tests on the disk, and can't see any hardware problems.

    We're running a XEN/Linux (Debian Lenny) box here, running off a single SATA disk managed with LVM2. It's been in and running for more than a year, with the only major changes being recent apt-get upgrade of the kernel.

    # uname -a
    Linux hostname 2.6.26-2-xen-amd64 #1 SMP Thu Sep 16 16:32:15 UTC 2010 x86_64 GNU/Linux
    

    The errors appear like this:

    # vgck
    /dev/dm-20: read failed after 0 of 4096 at 0: Input/output error
    

    And then when I try to start the VM which uses that LV for its C-drive (it's a Windows virtual machine), the VM refuses to start and I see this at the end of the /var/log/xen/qemu-dm-*.log log file:

    ...
    Register xen platform.
    Done register platform.
    raw_read(6:/dev/vgroup/newvm-cdrive, 0, 0x7fff02bca520, 512) [20971520] read failed -1 : 5 = Input/output error
    I/O request not ready: 0, ptr: 0, port: 0, data: 0, count: 0, size: 0
    raw_read(6:/dev/vgroup/newvm-cdrive, 0, 0x12dfff0, 512) [20971520] read failed -1 : 5 = Input/output error
    

    This first happened on 2 VMs whose disk was based on a snapshot of a third, original VM. I nuked the 2 LVs and recreated them (again by snapshotting the same, original VM's LV), and they've been fine since.

    However, today I attempted to create a new VM. I snapshotted the same, orginal VM's LV (lvcreate -L500M --snapshot --name newvm-cdrive /dev/vgroup/original-cdrive), and created the new VM. It initially worked, but after shutting down the VM once, it refuses to start up again, with the errors shown above.

    My obvious first guess would be physical problems with the drive, but smartmon does not report anything:

    # smartctl -t long /dev/sda
    # [later]
    # smartctl -l selftest /dev/sda
    smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
    Home page is http://smartmontools.sourceforge.net/
    
    === START OF READ SMART DATA SECTION ===
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Extended offline    Completed without error       00%         1         -
    # 2  Short offline       Completed without error       00%         0         -
    

    Also, not getting any errors from badblocks.

    I've tried running vgck and pvck:

    # vgck vgroup -v
        Using volume group(s) on command line
        Finding volume group "vgroup"
      /dev/dm-20: read failed after 0 of 4096 at 0: Input/output error
    
    # pvck /dev/sda2
      Found label on /dev/sda2, sector 1, type=LVM2 001
      Found text metadata area: offset=4096, size=192512
    

    Have found a few references to this error message ("read failed after 0 of 4096 at...") on the Interwebs, but nothing which seems to apply to my situation.

    Any ideas?

    Update: As requested, below is output of lvdisplay and ls -l. Running out of COW space is plausible. How do I tell?

    # lvdisplay /dev/vgroup/newvm-cdrive
      /dev/dm-20: read failed after 0 of 4096 at 0: Input/output error
      --- Logical volume ---
      LV Name                /dev/vgroup/newvm-cdrive
      VG Name                vgroup
      LV UUID                jiarxt-q2NO-SyIf-5FrW-I9iq-mNEQ-iwS4EH
      LV Write Access        read/write
      LV snapshot status     INACTIVE destination for /dev/vgroup/original-cdrive
      LV Status              available
      # open                 0
      LV Size                10.00 GB
      Current LE             2560
      COW-table size         200.00 MB
      COW-table LE           50
      Snapshot chunk size    4.00 KB
      Segments               1
      Allocation             inherit
      Read ahead sectors     auto
      - currently set to     256
      Block device           254:20
    
    # ls -l /dev/dm-20
    brw-rw---- 1 root disk 254, 20 2010-10-11 15:02 /dev/dm-20
    

    And here's fdisk -l.

    # fdisk -l /dev/sda
    
    Disk /dev/sda: 160.0 GB, 160000000000 bytes
    255 heads, 63 sectors/track, 19452 cylinders
    Units = cylinders of 16065 * 512 = 8225280 bytes
    Disk identifier: 0x00000080
    
       Device Boot      Start         End      Blocks   Id  System
    /dev/sda1   *           1          31      248976   83  Linux
    /dev/sda2              32       19452   155999182+  8e  Linux LVM
    
    • bindbn
      bindbn over 13 years
      add please fdisk -l
    • MastaJeet
      MastaJeet over 13 years
      Maybe your snapshots are running out of space. What is the output of lvdisplay? Also can you give the major/minor numbers for dm-20 (ls -l /dev/dm-20).
    • andrewf
      andrewf over 13 years
      @embobo, I just tried "lvextend +1.8G /dev/vgroup/newvm-cdrive", to expand the snapshot COW space. The extend succeeded, but I'm still seeing the same errors. (Maybe LVM just gives up in confusion once it's run out of COW space and adding more has no effect?)
    • andrewf
      andrewf over 13 years
      Okay, just discovered the "lvs" command, and it says that the logical volume has a Snap% of 100.00... so I'm guessing you're right, @embobo. So I guess the question now becomes: is that the LV stuffed? Should I just nuke it and recreate, or can I recover it?