Reduce bad block retry / wait times in Ubuntu

linux hard-disk performance io badblocks

5,763

Solution 1

I've not used this tunable before but you probably want to adjust the eh_timeout (error handling timeout) for the drive in question:

[root@localhost device]# cat /sys/block/sda/device/eh_timeout
10
[root@localhost device]#

The above shows sda set to 10 seconds. From Red Hat Knowledgebase:

In certain storage configurations (for example, configurations with many LUNs), the SCSI error handling code can spend a large amount of time issuing commands such as TEST UNIT READY to unresponsive storage devices. A new sysfs parameter, eh_timeout, has been added to the SCSI device object, which allows configuration of the timeout value for TEST UNIT READY and REQUEST SENSE commands used by the SCSI error handling code. This decreases the amount of time spent checking these unresponsive devices. The default value of eh_timeout is 10 seconds, which was the timeout value used prior to adding this functionality.

Solution 2

Monitor /sys/block/<dev>/stat for the devices you're interested in and compare the 10th parameter (io_ticks).

eg, ticks = io_ticks - prev_ticks / seconds_deltatime / 10

This is the percentage of available time that the disk has spent waiting for disk io.

Close to 100% would be worth checking of course, or else get really clever and compare it to the average of all your disks and pick on any disk(s) above the mean.

See the block layer statistics documentation.

Else use something like Munin and graph it. You can get Munin to alert if it goes above a threshold, eg, 90% or whatever your graphing shows is a good alert figure.

eg, see these two Munin graphs showing that /dev/sdi needs looking at. In this example if /dev/sdi is part of an array the whole array would suffer because of it.

Disk utilization per device - by day

Disk utilization per device - by week

If you look at the week graph you'll see that /dev/sdc might be slow as well.

I should add that /dev/sdi above isn't broken, it's just a slow disk (actually a green disk that somebody added to an array of enterprise grade sata disks) which slowed the array down. An actual failed disk would stick out like a sore thumb.

In summary, I'd probably go with a script if I had the time, but Munin if I just wanted a quick solution and connecting to the server was easy.

5,763

Ryan Sorensen

Updated on September 18, 2022

Comments

Ryan Sorensen over 1 year

How can I reduce the IO wait time and retry times so that the OS doesn't continually try to write to a failing drive?

I have an system that I use to make copies of demo content that gets loaned out to customers on to regular SATA desktop hard drives. We connect many drives at once via SAS and copy content to them using a script.

Because the drives are loaned out, occasionally some come back damaged but I don't know that they are damaged, so the next time that drive gets reused in a copy operation, it slows down other drives as the system retries IO to that drive. Sometimes it can take hours before I notice the bad drive and remove it. After the drive is removed, the rest of the drives begin writing at normal speed.

I do not care about recovering the bad drives. I just need to weed them out so they don't slow everything else down.

I am also researching badblocks and smartmontools and considering writing a pre-check on the drives before I start writing.

OS: Ubuntu Linux (12.04 lts)
- Deer Hunter almost 10 years
  
  What's wrong with checking SMART data through udisks/smartmonctl? A classical XY problem here, methinks.
- Ryan Sorensen almost 10 years
  
  Thanks, I will research smartmonctl more. In my experience, if the bad sectors happened during the last shipment, the SMART status shows that the drive is still good, and it performs fine until some random part during the copy, and then slows down to a crawl, also affecting other drives until it is removed.
- imz -- Ivan Zakharyaschev almost 10 years
  
  The question hasn't received a direct answer, so we don't know whether it's a possible thing in linux: How can I reduce the IO wait time and retry times?
- goldilocks almost 10 years
  
  @imz--IvanZakharyaschev unix.stackexchange.com/a/147304/25985 However, the kernel does log these errors, so if all you want to do is catch a failing disk before it becomes more trouble, you could scan the system logs at regular intervals.
- imz -- Ivan Zakharyaschev almost 10 years
  
  @gol What if I want to catch it faster? Without waiting God knows how much time before the IO operation unblocks reporting an error? (Actually, I'm attempting to save the data from a disk with errors, but my problem is similar: running into these "erroneous" sectors causes huge delays. ... Perhaps I could also follow the advice, and invent a way to feed the info from SMART test to ddrescue so that it doesn't even touch the sectors reported by SMART.)
imz -- Ivan Zakharyaschev over 9 years

Thanks! The information about io statistics in Linux is really new and seems to be useful (to me) in such situations.
Ryan Sorensen over 9 years

I am checking this now. Ubuntu does not have an eh_timeout, but has a timeout file which may be the same thing. The default Ubuntu value appears to be 30 sec. Will reduce it to 5 seconds and report back.
Bratchley over 9 years

Out of curiosity, what was your result?
Ryan Sorensen over 9 years

Setting the timeout flag on 12.04 did not appear to do anything. I am planning to upgrade a test system to 14.04 this weekend because it does have eh_timeout (and also timeout).
Nat almost 9 years

@RyanSorensen, so did you get a chance to see whether this parameter ever works?
GuitarPicker over 6 years

I wasn't able to modify eh_timeout but I was able to change timeout to accomplish the task at hand.