ZFS: good read but poor write speeds
You have set ashift=0
, which causes slow write speeds when you have HD drives that use 4096 byte sectors. Without ashift
, ZFS doesn't properly align writes to sector boundaries -> hard disks need to read-modify-write 4096 byte sectors when ZFS is writing 512 byte sectors.
Use ashift=12
to make ZFS align writes to 4096 byte sectors.
You also need to check that the alignment of your partition is correct with respect to the actual hard disk in use.
Related videos on Youtube
BayerSe
Updated on September 18, 2022Comments
-
BayerSe almost 2 years
I'm in charge of downloading and processing large amounts of financial data. Each trading day, we have to add around 100GB.
To handle this amount of data, we rent a virtual server (3 cores, 12 GB ram) and a 30 TB block device from our university's data center.
On the virtual machine I installed Ubuntu 16.04 and ZFS on Linux. Then, I created a ZFS pool on the 30TB block device. The main reason for using ZFS is the compression feature as the data is nicely compressible (~10%). Please don't be too hard on me for not following the golden rule that ZFS wants to see bare metal, I am forced to use the infrastructure as it is.
The reason for posting is that I face a problem of poor write speeds. The server is able to read data with about 50 MB/s from the block device but writing data is painfully slow with about 2-4 MB/s.
Here is some information on the pool and the dataset:
zdb
tank: version: 5000 name: 'tank' state: 0 txg: 872307 pool_guid: 8319810251081423408 errata: 0 hostname: 'TAQ-Server' vdev_children: 1 vdev_tree: type: 'root' id: 0 guid: 8319810251081423408 children[0]: type: 'disk' id: 0 guid: 13934768780705769781 path: '/dev/disk/by-id/scsi-3600140519581e55ec004cbb80c32784d-part1' phys_path: '/iscsi/[email protected]%3Asn.606f4c46fd740001,0:a' whole_disk: 1 metaslab_array: 30 metaslab_shift: 38 ashift: 9 asize: 34909494181888 is_log: 0 DTL: 126 create_txg: 4 features_for_read: com.delphix:hole_birth com.delphix:embedded_data
zpool get all
NAME PROPERTY VALUE SOURCE tank size 31,8T - tank capacity 33% - tank altroot - default tank health ONLINE - tank guid 8319810251081423408 default tank version - default tank bootfs - default tank delegation on default tank autoreplace off default tank cachefile - default tank failmode wait default tank listsnapshots off default tank autoexpand off default tank dedupditto 0 default tank dedupratio 1.00x - tank free 21,1T - tank allocated 10,6T - tank readonly off - tank ashift 0 default tank comment - default tank expandsize 255G - tank freeing 0 default tank fragmentation 12% - tank leaked 0 default tank feature@async_destroy enabled local tank feature@empty_bpobj active local tank feature@lz4_compress active local tank feature@spacemap_histogram active local tank feature@enabled_txg active local tank feature@hole_birth active local tank feature@extensible_dataset enabled local tank feature@embedded_data active local tank feature@bookmarks enabled local tank feature@filesystem_limits enabled local tank feature@large_blocks enabled local
zfs get all tank/test
NAME PROPERTY VALUE SOURCE tank/test type filesystem - tank/test creation Do Jul 21 10:04 2016 - tank/test used 19K - tank/test available 17,0T - tank/test referenced 19K - tank/test compressratio 1.00x - tank/test mounted yes - tank/test quota none default tank/test reservation none default tank/test recordsize 128K default tank/test mountpoint /tank/test inherited from tank tank/test sharenfs off default tank/test checksum on default tank/test compression off default tank/test atime off local tank/test devices on default tank/test exec on default tank/test setuid on default tank/test readonly off default tank/test zoned off default tank/test snapdir hidden default tank/test aclinherit restricted default tank/test canmount on default tank/test xattr on default tank/test copies 1 default tank/test version 5 - tank/test utf8only off - tank/test normalization none - tank/test casesensitivity mixed - tank/test vscan off default tank/test nbmand off default tank/test sharesmb off default tank/test refquota none default tank/test refreservation none default tank/test primarycache all default tank/test secondarycache all default tank/test usedbysnapshots 0 - tank/test usedbydataset 19K - tank/test usedbychildren 0 - tank/test usedbyrefreservation 0 - tank/test logbias latency default tank/test dedup off default tank/test mlslabel none default tank/test sync disabled local tank/test refcompressratio 1.00x - tank/test written 19K - tank/test logicalused 9,50K - tank/test logicalreferenced 9,50K - tank/test filesystem_limit none default tank/test snapshot_limit none default tank/test filesystem_count none default tank/test snapshot_count none default tank/test snapdev hidden default tank/test acltype off default tank/test context none default tank/test fscontext none default tank/test defcontext none default tank/test rootcontext none default tank/test relatime off default tank/test redundant_metadata all default tank/test overlay off default tank/test com.sun:auto-snapshot true inherited from tank
Can you give me a hint what I could do to improve the write speeds?
Update 1
After your comments about the storage system I went to the IT department. The guy told me that the logical block which the vdev exports is actually 512 B.
This is the output of
dmesg
:[ 8.948835] sd 3:0:0:0: [sdb] 68717412272 512-byte logical blocks: (35.2 TB/32.0 TiB) [ 8.948839] sd 3:0:0:0: [sdb] 4096-byte physical blocks [ 8.950145] sd 3:0:0:0: [sdb] Write Protect is off [ 8.950149] sd 3:0:0:0: [sdb] Mode Sense: 43 00 10 08 [ 8.950731] sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, supports DPO and FUA [ 8.985168] sdb: sdb1 sdb9 [ 8.987957] sd 3:0:0:0: [sdb] Attached SCSI disk
So 512 B logical blocks but 4096 B physical block?!
They provide me some temporary file system to which I can backup the data. Then, I will first test the speed on the raw device before setting up the pool from scratch. I will send an update.
Update 2
I destroyed the original pool. Then I ran some speed tests using
dd
, the results are ok - around 80MB/s in both directions.As a further check I created an ext4 partition on the device. I copied a large zip file to this ext4 partition and the average write speed is around 40MB/s. Not great, but enough for my purposes.
I continued by creating a new storage pool with the following commands
zpool create -o ashift=12 tank /dev/disk/by-id/scsi-3600140519581e55ec004cbb80c32784d zfs set compression=on tank zfs set atime=off tank zfs create tank/test
Then, I again copied a zip file to the newly create
test
file system. The write speed is poor, just around 2-5 MB/s.Any ideas?
Update 3
tgx_sync
is blocked when I copy the files. I opened a ticket on the github repository of ZoL.-
ewwhite almost 8 yearsDo we know anything about how the storage device is connected to the VM? Also, you don't appear to have compression enabled.
-
BayerSe almost 8 yearsThey say it is 10GbE. On the test file system I disabled compression on purpose to be not bound by the CPU. The results are approximately the same, however, no matter whether compression is enabled or not.
-
user121391 almost 8 yearsNetwork throughput would only be of concern if you do not get more than 110 MB/s, which is far beyond your current speed. You need to ask them about the kind of storage subsystem, the maximum, average and minimum expected performance for random and sequential access, and the blocksize on which it is aligned.
-
Andrew Henle almost 8 yearsWhat's the raw disk write performance? Can you test that? Because if the raw disk can't meet your performance requirements, there's no file system in the universe that will save you.
-
BayerSe almost 8 years@AndrewHenle In the IT department they tested the read speed of the raw disk using
dd
. It is about 90 MB/s (as opposed to abut 40-50 MB/s) on the file system. I'll add write speed results. -
Andrew Henle almost 8 years@BayerSe
dd
testing will test sequential read/write performance. Sequential operations like that are often coalesced into large blocks to/from the actual disk(s) via caching and the use of either read-ahead or write-behind. File system access can be extremely random and in small blocks, which doesn't lend itself to caching or read-ahead. A disk system can give good large-block sequential performance while still having abysmal random, small-block performance - especially write performance.dd
testing is an easy start, because if it's poor, everything else will also be poor. -
user121391 almost 8 years"So 512 B logical blocks but 4096 B physical block?!" That is (was) not that uncommon - newer disks used 4k bytes sectors internally, but presented 512 bytes to the operating system, known as "4k/512e" ("4k emulated") as opposed to the older 512/512 ("512 native") or the newer 4k/4k ("4k native").
-
not-a-user almost 7 yearsAny progress on this? I have the same issue on arch/armv7. Somehow it seems neither CPU bound (frequency governor does not scale up) nor IO bound (the same crappy 4 M/s write speed for both an hdd as well as an emmc-backed loop device). Is your Ubuntu guest 32 bits or 64 bits? (What does
uname -a
say?) -
BayerSe almost 7 years@not-a-user
Linux TAQ-Server 4.4.0-92-generic #115-Ubuntu SMP Thu Aug 10 09:04:33 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
is the output. I never directly solved the problem, but after re-creating the block device and applying some settings discussed here: list.zfsonlinux.org/pipermail/zfs-discuss/2016-July/025979.html, the problem vabished. -
BayerSe almost 7 years@not-a-user This is what I ised at the end: gist.github.com/BayerSe/393b4664d42b85ade63660fb1f357482
-
Rob Pearson about 6 yearsFor 30tb, ZFS needs FAR more RAM than 12GB. You should be at 48GB minimum. Math for it - 8GB baseline + 30GB (1GB per managed TB) = 38GB But 38 isn't a sensible size so your next stop is 48GB. serverfault.com/questions/569354/…
-
-
ewwhite almost 8 yearsThe storage is abstracted. It's probably an export from a SAN. The ashift may not make a difference here.
-
BayerSe almost 8 yearsI'm confused. The
zdb
command saysashift=0
,zpool get all
says it's9
. What is the correct value? And what could I ask the IT guys to figure out whetherashift=12
would be the correct value? -
Tero Kilkanen almost 8 yearsActually
zdb
tells that the iSCSI device hasashift=9
andzpool get all
says it is0
. I don't actually know what is the minimum write block used when ashift is 0. You can try bothashift=9
andashift=12
. You need to ask, what is the minimum block size for the storage system that doesn't trigger read-modify-write during writes. -
Andrew Henle almost 8 yearsNo real answers are possible without detailed knowledge of what that iSCSI disk actually is. For all we know it's a 37-disk RAID5 array of mixed 5400 and 7200 RPM SATA drives with a per-disk segment size of 1 MB that's been partitioned into 137 LUNs that are all utterly misaligned. If something like that is true (and I've seen incompetent SAN setups like that all too often), the OPs task likely hopeless. If the system is still in test and the ZFS file system can be safely destroyed, raw disk write performance using something like Bonnie or even just
dd
would be a good data point to have.