How do you "fix" a faulty path in device-mapper-multipath

44,169

There's a subtle bug in your multipath.conf, vendor and product are matching at the regexp level, that you've added a series of leading spaces is causing multipathd to fail to match your configuration with the actual devices on the system. If you were to examine the output of echo 'show config' | multipathd -k you would find two device sections for your SAN, one that matches all the extra spaces you added, and the default config (should it exist) provided by internal database.

Adjust your multipath.conf to look like this:

            vendor                  "XIOTECH "
            product                 "ISE1400.*"

SCSI Inquiry expects a vendor field that is no greater than 8 characters terminated by an ASCII Zero, if you don't use all 8 you must pad the field with spaces to reach 8 characters. Multipathd is interpreting the spec to the letter of the law, you could have also done "XIOTECH.*" if you really want to be sure.

Once you make these changes, stop multipathd using your initscripts, multipath -F which will flush your config and then start multipathd again. Your config file should be honored now. If you still have problems, reboot.

If there's ever a doubt that your config file isn't being honored, always examine the running config using the echo incantation and compare what's loaded in the database to your config file.

Share:
44,169

Related videos on Youtube

Lennert
Author by

Lennert

internet/new-media high traffic web cluster focused sysadmin

Updated on September 17, 2022

Comments

  • Lennert
    Lennert almost 2 years

    I have a multipath config that was working but now shows a "faulty" path:

    [root@nas ~]# multipath -ll
    sdd: checker msg is "readsector0 checker reports path is down"
    mpath1 (36001f93000a63000019f000200000000) dm-2 XIOTECH,ISE1400
    [size=200G][features=0][hwhandler=0][rw]
    \_ round-robin 0 [prio=1][active]
     \_ 1:0:0:1 sdb 8:16  [active][ready]
    \_ round-robin 0 [prio=0][enabled]
     \_ 2:0:0:1 sdd 8:48  [active][faulty]
    

    At the same time I'm seeing these three lines over and over in /var/log/messages

    Feb  5 12:52:57 nas kernel: sd 2:0:0:1: SCSI error: return code = 0x00010000
    Feb  5 12:52:57 nas kernel: end_request: I/O error, dev sdd, sector 0
    Feb  5 12:52:57 nas kernel: Buffer I/O error on device sdd, logical block 0
    

    And this line shows up fairly often too

    Feb  5 12:52:58 nas multipathd: sdd: readsector0 checker reports path is down
    

    One thing I don't understand is why its using the readsector0 checking method when my /etc/multipath.conf file say to use tur

    [root@nas ~]# tail -n15 /etc/multipath.conf

    devices {
            device {
                    vendor                  "XIOTECH "
                    product                 "ISE1400         "
                    path_grouping_policy    multibus
                    getuid_callout          "/sbin/scsi_id -g -u -d /dev/%n"
                    path_checker            tur
                    prio_callout              "none"
                    path_selector           "round-robin 0"
                    failback                    immediate
                    no_path_retry           12
                    user_friendly_names yes
            }
    }
    

    Looking at the upstream documentation here this paragraph seems relevant: http://christophe.varoqui.free.fr/usage.html

    For each path:
    
    \_ host:channel:id:lun devnode major:minor [path_status][dm_status_if_known]
    
    The dm status (dm_status_if_known) is like the path status
    (path_status), but from the kernel's point of view. The dm status has two
    states: "failed", which is analogous to "faulty", and "active" which
    covers all other path states. Occasionally, the path state and the 
    dm state of a device will temporarily not agree. 
    

    Its been well over 24 hours for me so its not temporary.

    So with all that as background my questions are
    - how can I determine the root cause here?
    - how can I manually/command-line perform whatever check its doing
    - why is it ignoring my multipath.conf (did I do it wrong?)

    Thanks in advance for any ideas, if there's anything else I can provide for info let me know in a comment and I'll edit it into the post.

    • Admin
      Admin over 14 years
      Hmmm for IBM arrays I don't attach spaces to vendor/model and they are recognized properly. To show why you don't get tur path checker, please paste relevant snippets from: multipath -d -v3
    • Admin
      Admin over 14 years
      thank you kubanskamac, it turns out it is picky about the spaces and thats why my config snippet was being ignored and multipath was reverting to the defaults. The good news is the defaults actually work better as that getuid_callout line fails when I clean up the spaces. Chalk this up to me trusting terrible vendor documentation too much.
    • Admin
      Admin about 13 years
      So what about the root cause of your problem? Is there a faulty disk or have you lost a path to the disk?