Tuning ZFS scrubbing, 141KB/s running for 15 days
Solution 1
ZFS scrub operations operate on some fairly brain-dead principles. Most notably, it only spends time scrubbing when there's nothing else going on. If you poke a pool with just a bit of data access on a fairly constant basis, scrub will effectively starve itself and do nearly nothing.
Tunables to explore, with my quick notes on what it does (I last looked into this awhile ago, though):
- zfs_scan_idle - if user I/O occurrs within this many clock ticks, delay scrub I/O by zfs_scrub_delay clock ticks
- zfs_scrub_delay - how many clock ticks to delay scrub operation if triggered by zfs_scan_idle
- zfs_top_maxinflight - maximum number of scrub I/O per top-level vdev
- zfs_scrub_limit - maximum number of scrub I/O per leaf vdev
- zfs_scan_min_time_ms - minimum ms to spend per txg on scrub operations
- zfs_no_scrub_io - no notes
- zfs_no_scrub_prefetch - no notes, name seems to imply not causing prefetch on scrub ops
All of these are changeable on the fly using "echo [tunable]/W0t[number]" to change, and "echo [tunable]/D" to view current setting (which I recommend doing before changing).
So in theory, and in general practice, if you were to, say, change zfs_scan_idle down to 10 (or 1 - or 0, if it supports that, would need to check the code) and zfs_scrub_delay down to 1 (or 0, if it supports that), and if your txg_synctime_ms setting is 5000 or more maybe change zfs_scan_min_time_ms up a bit, it should become a lot more aggressive about actually doing scrub operations even with some level of user I/O occurring.
In your specific case, the %b and asvc_t reported imply some very, very random read workload going on (spinning disks should do better than that if it is truly sequential), and you've already done the "easy" stuff as explained above. So, first I'd turn on zfs_no_scrub_prefetch, to disable prefetch on scrub operations, just to see if that helped. If no joy, depending on the version of Nexenta you're on - you may be running 30/5, 5/1 or 10/5 (that's shorthand we use for the settings of zfs_txg_timeout & (zfs_txg_synctime_ms*1000)). Change zfs_txg_timeout to 10 and zfs_txg_synctime_ms to 5000, then try upping zfs_scan_min_time_ms to 3000 or 4000. This tells ZFS it can spend a lot longer on scrubs, as compared to the default settings on older NexentaStor installs that use 5/1 as the defaults - but careful, this may starve normal I/O if the delay settings have also been set to basically 0!
Hope this helps. Good luck!
Solution 2
I suspect hardware...
Why would you let this run for 15 days? That's not normal. Stop the scrub - zpool scrub -s tank
and check the system out.
- Which controllers are you using?
- Is this the first scrub you've ever run on this pool?
- Was there a problem that prompted you to run the scrub in the first place?
Related videos on Youtube
SiXoS
Specialized in Linux, storage, virtualization and networks.
Updated on September 18, 2022Comments
-
SiXoS almost 2 years
A pretty basic system running mirror+stripe on 7.2k rpm sas disks, not particularly loaded. No dedup, compression on all datasets. Scrub has been running for 15 days at the speed of a dead snail. Is there some optimization that needs be done, or can it be due to some faulty hw?
- Dell R510 with MD1200 enclosure.
- 2x Xeon E5620
- 48GB
- NexentaStor 3.1.3, Community edition
Some information:
scan: scrub in progress since Mon Apr 1 19:00:05 2013 171G scanned out of 747G at 141K/s, 1187h40m to go 0 repaired, 22.84% done config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c7t5000C500414FB2CFd0 ONLINE 0 0 0 c7t5000C500414FCA57d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c7t5000C500415C3B1Bd0 ONLINE 0 0 0 c7t5000C500415C5E4Fd0 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 c7t5000C500415DC797d0 ONLINE 0 0 0 c7t5000C500415DC933d0 ONLINE 0 0 0 logs c7t5000A7203006D81Ed0 ONLINE 0 0 0 cache c7t5000A72030068545d0 ONLINE 0 0 0 # iostat -en ---- errors --- s/w h/w trn tot device 0 8887 0 8887 c2t0d0 0 0 0 0 c0t395301D6B0C8069Ad0 0 0 0 0 c7t5000C500415DC933d0 0 0 0 0 c7t5000A72030068545d0 0 0 0 0 c7t5000C500415DC797d0 0 0 0 0 c7t5000C500414FCA57d0 0 0 0 0 c7t5000C500415C3B1Bd0 0 0 0 0 c7t5000C500415C5E4Fd0 0 0 0 0 c7t5000C500414FB2CFd0 0 0 0 0 c7t5000A7203006D81Ed0
The spa_last_io is changed every time I run this
# echo "::walk spa | ::print spa_t spa_name spa_last_io spa_scrub_inflight" | mdb -k spa_name = [ "syspool" ] spa_last_io = 0x25661402 spa_scrub_inflight = 0 spa_name = [ "tank" ] spa_last_io = 0x25661f84 spa_scrub_inflight = 0x21
Every 5 seconds, about 20-25 MB/s is written. Between those writes there's basically no reads or writes.
capacity operations bandwidth latency pool alloc free read write read write read write ------------------------- ----- ----- ----- ----- ----- ----- ----- ----- syspool 427G 501G 0 0 0 0 0.00 0.00 c0t395301D6B0C8069Ad0s0 427G 501G 0 0 0 0 0.00 0.00 ------------------------- ----- ----- ----- ----- ----- ----- ----- ----- tank 903G 1.84T 810 5.21K 1.50M 20.8M 9.42 4.71 mirror 301G 627G 22 1.00K 53.0K 3.96M 8.96 3.93 c7t5000C500414FB2CFd0 - - 20 244 50.1K 3.97M 6.70 1.14 c7t5000C500414FCA57d0 - - 19 242 48.2K 3.97M 7.60 1.12 mirror 301G 627G 25 1016 46.8K 4.10M 16.11 5.28 c7t5000C500415C3B1Bd0 - - 21 257 41.6K 4.11M 4.63 1.24 c7t5000C500415C5E4Fd0 - - 21 255 43.0K 4.11M 16.54 1.15 mirror 301G 627G 62 754 119K 3.03M 19.72 3.78 c7t5000C500415DC797d0 - - 57 219 114K 3.03M 9.99 1.15 c7t5000C500415DC933d0 - - 56 220 119K 3.03M 13.20 1.22 c7t5000A7203006D81Ed0 260K 46.5G 0 0 0 0 0.00 0.00 cache - - - - - - c7t5000A72030068545d0 93.1G 8M 0 0 0 0 0.00 0.00 ------------------------- ----- ----- ----- ----- ----- ----- ----- -----
Are the iostats telling me that I am spending more time waiting for disk then I should have to? Specifically the %b column
# iostat -xe device r/s w/s kr/s kw/s wait actv svc_t %w %b s/w h/w trn tot sd3 5.1 43.9 20.6 643.8 0.0 0.1 2.9 0 5 0 0 0 0 sd4 9.4 1.8 141.1 169.6 0.0 0.0 0.5 0 0 0 0 0 0 sd5 3.1 43.8 15.8 643.8 0.0 0.1 1.4 0 3 0 0 0 0 sd6 5.2 38.1 14.3 494.4 0.0 0.1 3.0 0 7 0 0 0 0 sd7 4.2 40.2 11.1 623.2 0.0 0.1 2.7 0 7 0 0 0 0 sd8 3.6 44.3 9.7 623.2 0.0 0.1 1.5 0 4 0 0 0 0 sd9 2.9 37.4 7.0 494.4 0.0 0.1 1.3 0 2 0 0 0 0 sd10 0.7 0.4 3.4 0.0 0.0 0.0 0.0 0 0 0 0 0 0
The latency a tad on the high side?
# zpool iostat 10 10 capacity operations bandwidth latency pool alloc free read write read write read write tank 909G 1.83T 86 2.82K 208K 12.7M 22.68 13.63 ---------- ----- ----- ----- ----- ----- ----- ----- ----- tank 909G 1.83T 29 857 42.4K 3.50M 17.86 4.47 ---------- ----- ----- ----- ----- ----- ----- ----- ----- tank 909G 1.83T 30 947 46.1K 3.54M 15.55 5.67
Applied some tweaking which made little difference. zfs_top_maxinflight set to 127, zfs_scrub_delay to 0, and zfs_scan_idle to 0.
# echo zfs_top_maxinflight | mdb -k zfs_top_maxinflight: zfs_top_maxinflight: 127 # echo zfs_scrub_delay/D |mdb -k zfs_scrub_delay: zfs_scrub_delay:0 # echo zfs_scan_idle/D |mdb -k zfs_scan_idle: zfs_scan_idle: 0 scan: scrub in progress since Wed Apr 17 20:47:23 2013 1.85G scanned out of 918G at 1.14M/s, 229h36m to go 0 repaired, 0.20% done
pre mdb tweak, notice the rather high b% column
$ iostat -nx -M 5
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c2t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t395301D6B0C8069Ad0 35.2 44.2 0.3 0.7 0.0 0.4 0.0 5.3 0 32 c7t5000C500415DC933d0 19.8 3.2 0.2 0.0 0.0 0.0 0.0 0.1 0 0 c7t5000A72030068545d0 31.2 46.2 0.2 0.7 0.0 0.3 0.0 4.4 0 27 c7t5000C500415DC797d0 30.6 46.8 0.2 0.8 0.0 0.4 0.0 4.6 0 28 c7t5000C500414FCA57d0 37.6 53.0 0.3 0.8 0.0 0.4 0.0 4.7 0 33 c7t5000C500415C3B1Bd0 37.6 53.6 0.3 0.8 0.0 0.5 0.0 5.6 0 39 c7t5000C500415C5E4Fd0 33.2 46.8 0.3 0.8 0.0 0.5 0.0 6.1 0 33 c7t5000C500414FB2CFd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c7t5000A7203006D81Ed0
post mdb tweak, notice the b% column, 80-85% time in busy wait
$ iostat -nx -M 5 r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c2t0d0 0.2 27.2 0.0 0.3 0.0 1.0 0.0 35.4 0 18 c0t395301D6B0C8069Ad0 129.6 20.2 0.9 0.4 0.0 2.9 0.0 19.5 0 85 c7t5000C500415DC933d0 48.4 4.0 0.4 0.0 0.0 0.0 0.0 0.1 0 1 c7t5000A72030068545d0 130.4 19.8 0.9 0.4 0.0 3.0 0.0 20.2 0 84 c7t5000C500415DC797d0 125.8 25.8 0.9 0.5 0.0 2.9 0.0 19.2 0 80 c7t5000C500414FCA57d0 131.2 24.2 0.9 0.5 0.0 3.1 0.0 20.3 0 83 c7t5000C500415C3B1Bd0 130.6 25.8 0.9 0.5 0.0 3.5 0.0 22.5 0 88 c7t5000C500415C5E4Fd0 126.8 28.0 0.9 0.5 0.0 2.8 0.0 18.0 0 79 c7t5000C500414FB2CFd0 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0 0 c7t5000A7203006D81Ed0
-
Admin about 11 yearsWhat multiple occurence of iostat -XnE | grep Errors says ? do some error count increment ?
-
SiXoS about 11 yearsZero in all columns
-
Philip about 11 yearsWhat does
smartctl -A /dev/disk
say about each drive (might have to installsmartctl
, not sure if it comes with the base install). -
SiXoS about 11 yearsNothing of interest, besides "Non-medium error count: 8071" on one disk. All disks sit in a JBOD (Dell MD1200) on the same (single) sas lane
-
SiXoS about 11 yearsLSI SAS9200-8e (IT firmware). Not first scrub. No, no real issues (but I have been questioning the sequential read/write performance for a while).
-
SiXoS about 11 yearsI don't see any indicator that any resources is scarce.
-
Nex7 about 11 yearsI should also read the entire question before responding - and stop browsing ServerFault while on conference calls. :)
-
Nex7 about 11 yearsThe %b and asvc_t reported imply some very, very random read workload going on (spinning disks should do better than that if it is truly sequential). First I'd turn on zfs_no_scrub_prefetch, to disable prefetch on scrub operations, just to see if that helped. If no joy, depending on the version of Nexenta you're on - you may be running 30/5, 5/1 or 10/5 (zfs_txg_timeout & (zfs_txg_synctime_ms*1000). Change zfs_txg_timeout to 10 and zfs_txg_synctime_ms to 5000, then try upping zfs_scan_min_time_ms to 3000 or 4000. This tells ZFS it can spend a lot longer on scrubs, may starve normal I/O!
-
SiXoS about 11 yearsI think you provide very valuable input, but it would be much more helpful if you could add the comments into one good answer.
-
Nex7 about 11 yearsThis is pretty much completely false. CPU & RAM have effectively zero impact on scrub operations (assuming there's any free at all). Having lots of free RAM & CPU will not 'speed up' scrub operations. Scrub is limited by watching incoming I/O to the pool, not by checking for 'available system downtime', whatever that is.
-
Nex7 about 11 yearsFair enough - added it all into the answer.
-
SiXoS about 11 yearsWith zfs_max_topinflight 127, zfs_scan_idle=0 and zfs_scrub_delay=0 I effectively starved all other I/O, but got the scrub to reach 18MB/s so that it could finish in a day or so. Next time I'll be sure to pay close attention to your answer, which contains many parameters I haven't even heard of.
-
Nex7 about 11 yearsMore tuning may have helped, but not necessarily. It is important to note that a ZFS scrub rolls through the data structure, NOT sector by sector on the disks. Which is to say, depending on how the zfs data structure looks on your disks, a scrub operation may look incredibly random -- your disks may be capable of > 100 MB/s of sequential read, but completely random read will be another story entirely. Average block size would also matter here.
-
lschweiss over 6 yearsScrubs are important to run periodically. Waiting until you have a problem to run a scrub is asking for that problem to blow up into data loss. Scrubs are there to catch silent data corruption (bitrot). A slow running scrub is not a sign of a system problem, just a pool that is kept busy enough not to let the scrub accelerate.