How do you "fix" a faulty path in device-mapper-multipath

centos storage-area-network rhel5 multipath

44,169

There's a subtle bug in your multipath.conf, vendor and product are matching at the regexp level, that you've added a series of leading spaces is causing multipathd to fail to match your configuration with the actual devices on the system. If you were to examine the output of echo 'show config' | multipathd -k you would find two device sections for your SAN, one that matches all the extra spaces you added, and the default config (should it exist) provided by internal database.

Adjust your multipath.conf to look like this:

            vendor                  "XIOTECH "
            product                 "ISE1400.*"

SCSI Inquiry expects a vendor field that is no greater than 8 characters terminated by an ASCII Zero, if you don't use all 8 you must pad the field with spaces to reach 8 characters. Multipathd is interpreting the spec to the letter of the law, you could have also done "XIOTECH.*" if you really want to be sure.

Once you make these changes, stop multipathd using your initscripts, multipath -F which will flush your config and then start multipathd again. Your config file should be honored now. If you still have problems, reboot.

If there's ever a doubt that your config file isn't being honored, always examine the running config using the echo incantation and compare what's loaded in the database to your config file.

44,169

Lennert

internet/new-media high traffic web cluster focused sysadmin

Updated on September 17, 2022

Comments

Lennert almost 2 years

I have a multipath config that was working but now shows a "faulty" path:

[root@nas ~]# multipath -ll
sdd: checker msg is "readsector0 checker reports path is down"
mpath1 (36001f93000a63000019f000200000000) dm-2 XIOTECH,ISE1400
[size=200G][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=1][active]
 \_ 1:0:0:1 sdb 8:16  [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 2:0:0:1 sdd 8:48  [active][faulty]

At the same time I'm seeing these three lines over and over in /var/log/messages

Feb  5 12:52:57 nas kernel: sd 2:0:0:1: SCSI error: return code = 0x00010000
Feb  5 12:52:57 nas kernel: end_request: I/O error, dev sdd, sector 0
Feb  5 12:52:57 nas kernel: Buffer I/O error on device sdd, logical block 0

And this line shows up fairly often too

Feb  5 12:52:58 nas multipathd: sdd: readsector0 checker reports path is down

One thing I don't understand is why its using the readsector0 checking method when my /etc/multipath.conf file say to use tur

[root@nas ~]# tail -n15 /etc/multipath.conf

devices {
        device {
                vendor                  "XIOTECH "
                product                 "ISE1400         "
                path_grouping_policy    multibus
                getuid_callout          "/sbin/scsi_id -g -u -d /dev/%n"
                path_checker            tur
                prio_callout              "none"
                path_selector           "round-robin 0"
                failback                    immediate
                no_path_retry           12
                user_friendly_names yes
        }
}

Looking at the upstream documentation here this paragraph seems relevant: http://christophe.varoqui.free.fr/usage.html

For each path:

\_ host:channel:id:lun devnode major:minor [path_status][dm_status_if_known]

The dm status (dm_status_if_known) is like the path status
(path_status), but from the kernel's point of view. The dm status has two
states: "failed", which is analogous to "faulty", and "active" which
covers all other path states. Occasionally, the path state and the 
dm state of a device will temporarily not agree.

Its been well over 24 hours for me so its not temporary.

So with all that as background my questions are
- how can I determine the root cause here?
- how can I manually/command-line perform whatever check its doing
- why is it ignoring my multipath.conf (did I do it wrong?)

Thanks in advance for any ideas, if there's anything else I can provide for info let me know in a comment and I'll edit it into the post.

Admin over 14 years

Hmmm for IBM arrays I don't attach spaces to vendor/model and they are recognized properly. To show why you don't get tur path checker, please paste relevant snippets from: multipath -d -v3
Admin over 14 years

thank you kubanskamac, it turns out it is picky about the spaces and thats why my config snippet was being ignored and multipath was reverting to the defaults. The good news is the defaults actually work better as that getuid_callout line fails when I clean up the spaces. Chalk this up to me trusting terrible vendor documentation too much.
Admin about 13 years

So what about the root cause of your problem? Is there a faulty disk or have you lost a path to the disk?