How can I simulate a failed disk during testing?

25,687

Solution 1

There are several layers at which a disk error can be simulated. If you are testing a single user-space program, probably the simplest approach is to interpose the appropriate calls (e.g. write()) and have them sometimes return an error. The libfiu fault-injection library can do this using its fiu-run tool.

Another approach is to use a kernel driver that can pass through data to/from another device, but inject faults along the way. You can then mount the device and use it from any application as if it was a faulty disk. The fsdisk driver is an example of this.

There is also a fault injection infrastructure that has been merged in to the Linux kernel, although you will probably need to reconfigure your kernel to enable it. It is documented in Documentation/fault-injection/fault-injection.txt. This is useful for testing kernel code.

It is also possible to use SystemTap to inject faults at the kernel level. See The SCSI fault injection test and Kernel Fault injection using SystemTap.

Solution 2

To add to mark4o's answer, you can also use Linux's Device Mapper to generate failing devices.

Device Mapper's delay device can be used to send read and write I/O of the same block to different underlying devices (it can also delay that I/O as its name suggests). Device Mapper's error device can be used to generate permanent errors when a particular block is accessed. By combining the two you can create a device where writes always fail but reads always succeed for a given area.

The above is a more complicated example of what is described in the question Simulate a faulty block device with read errors? (see https://stackoverflow.com/a/1871029 for a simple Device Mapper example).

There is also a list of Linux disk fault injection mechanisms on the Special File that causes I/O error Unix & Linux question.

Solution 3

A simple way to make a SCSI disk disappear with a 2.6 kernel is:

echo 1 > /sys/bus/scsi/devices/H:B:T:L/delete

(H:B:T:L is host, bus, target, LUN). To simulate the read-only case you'll have to use the fault injection methods that mark4o mentioned, though.

Solution 4

Linux kernel provides a nice feature called “fault injection”

echo 1 > /sys/block/vdd/vdd2/make-it-fail

To setup some of the options:

mkdir /debug
mount debugfs /debug -t debugfs
cd /debug/fail_make_request
echo 10 > interval # interval
echo 100 > probability # 100% probability
echo -1 > times # how many times: -1 means no limit

https://lxadm.com/Using_fault_injection

Solution 5

You may use scsi_debug kernel module to simulate a RAM disk and it supports all the SCSI errors with opts and every_nth options.

Please check this http://sg.danny.cz/sg/sdebug26.html

Example on medium error on sector 4656:

[fge@Gris-Laptop ~]$ sudo modprobe scsi_debug opts=2 every_nth=1
[fge@Gris-Laptop ~]$ sudo dd if=/dev/sdb of=/dev/null
dd: error reading ‘/dev/sdb’: Input/output error
4656+0 records in
4656+0 records out
2383872 bytes (2.4 MB) copied, 0.021299 s, 112 MB/s
[fge@Gris-Laptop ~]$ dmesg|tail
[11201.454332] blk_update_request: critical medium error, dev sdb, sector 4656
[11201.456292] sd 5:0:0:0: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11201.456299] sd 5:0:0:0: [sdb] Sense Key : Medium Error [current] 
[11201.456303] sd 5:0:0:0: [sdb] Add. Sense: Unrecovered read error
[11201.456308] sd 5:0:0:0: [sdb] CDB: Read(10) 28 00 00 00 12 30 00 00 08 00
[11201.456312] blk_update_request: critical medium error, dev sdb, sector 4656

You could alter the opts and every_nth options in runtime via sysfs:

echo 2 | sudo tee /sys/bus/pseudo/drivers/scsi_debug/opts
echo 1 | sudo tee /sys/bus/pseudo/drivers/scsi_debug/opts
Share:
25,687
MarkR
Author by

MarkR

Experienced developer; I use stuff like MySQL, perl, work on pretty large systems. If you are a spam bot you might like to email me on [email protected] - otherwise, use my normal address.

Updated on July 18, 2022

Comments

  • MarkR
    MarkR almost 2 years

    In a Linux VM (Vmware workstation or similar), how can I simulate a failure on a previously working disc?

    I have a situation happening in production where a disc fails (probably a controller, cable or firmware problem). Obviously this is not predictable or reproducible, I want to test my monitoring to ensure that it alerts correctly.

    I'd ideally like to be able to simulate a situation where it fails writes but succeeds reads, as well as a complete failure, i.e. the scsi interface reports errors back to the kernel.

  • nick_g
    nick_g over 3 years
    Do you, by any chance, have a link for showing how to produce errors for the rest of the opts options?
  • Gris Ge
    Gris Ge over 3 years
    @nick_g Sorry, I don't, which option are you looking?
  • nick_g
    nick_g over 3 years
    I am trying to produce an error for each of these options of opts: 4 - ignore "nth" command causing a timeout. 8 - cause "nth" read or write command to yield a RECOVERED_ERROR. 0x10 - cause "nth" read-write command to yield an ABORTED_COMMAND (ack/nak timeout) which is a SAS transport error. 0x20 - cause "nth" read-write command to yield an ABORTED_COMMAND (logical block guard check failed), nominally a DIF (Protection Information) error