ZFS: good read but poor write speeds

12,756

You have set ashift=0, which causes slow write speeds when you have HD drives that use 4096 byte sectors. Without ashift, ZFS doesn't properly align writes to sector boundaries -> hard disks need to read-modify-write 4096 byte sectors when ZFS is writing 512 byte sectors.

Use ashift=12 to make ZFS align writes to 4096 byte sectors.

You also need to check that the alignment of your partition is correct with respect to the actual hard disk in use.

Share:
12,756

Related videos on Youtube

BayerSe
Author by

BayerSe

Updated on September 18, 2022

Comments

  • BayerSe
    BayerSe almost 2 years

    I'm in charge of downloading and processing large amounts of financial data. Each trading day, we have to add around 100GB.

    To handle this amount of data, we rent a virtual server (3 cores, 12 GB ram) and a 30 TB block device from our university's data center.

    On the virtual machine I installed Ubuntu 16.04 and ZFS on Linux. Then, I created a ZFS pool on the 30TB block device. The main reason for using ZFS is the compression feature as the data is nicely compressible (~10%). Please don't be too hard on me for not following the golden rule that ZFS wants to see bare metal, I am forced to use the infrastructure as it is.

    The reason for posting is that I face a problem of poor write speeds. The server is able to read data with about 50 MB/s from the block device but writing data is painfully slow with about 2-4 MB/s.

    Here is some information on the pool and the dataset:

    zdb

    tank:
    version: 5000
    name: 'tank'
    state: 0
    txg: 872307
    pool_guid: 8319810251081423408
    errata: 0
    hostname: 'TAQ-Server'
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 8319810251081423408
        children[0]:
            type: 'disk'
            id: 0
            guid: 13934768780705769781
            path: '/dev/disk/by-id/scsi-3600140519581e55ec004cbb80c32784d-part1'
            phys_path: '/iscsi/[email protected]%3Asn.606f4c46fd740001,0:a'
            whole_disk: 1
            metaslab_array: 30
            metaslab_shift: 38
            ashift: 9
            asize: 34909494181888
            is_log: 0
            DTL: 126
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
    

    zpool get all

    NAME  PROPERTY                    VALUE                       SOURCE
    tank  size                        31,8T                       -
    tank  capacity                    33%                         -
    tank  altroot                     -                           default
    tank  health                      ONLINE                      -
    tank  guid                        8319810251081423408         default
    tank  version                     -                           default
    tank  bootfs                      -                           default
    tank  delegation                  on                          default
    tank  autoreplace                 off                         default
    tank  cachefile                   -                           default
    tank  failmode                    wait                        default
    tank  listsnapshots               off                         default
    tank  autoexpand                  off                         default
    tank  dedupditto                  0                           default
    tank  dedupratio                  1.00x                       -
    tank  free                        21,1T                       -
    tank  allocated                   10,6T                       -
    tank  readonly                    off                         -
    tank  ashift                      0                           default
    tank  comment                     -                           default
    tank  expandsize                  255G                        -
    tank  freeing                     0                           default
    tank  fragmentation               12%                         -
    tank  leaked                      0                           default
    tank  feature@async_destroy       enabled                     local
    tank  feature@empty_bpobj         active                      local
    tank  feature@lz4_compress        active                      local
    tank  feature@spacemap_histogram  active                      local
    tank  feature@enabled_txg         active                      local
    tank  feature@hole_birth          active                      local
    tank  feature@extensible_dataset  enabled                     local
    tank  feature@embedded_data       active                      local
    tank  feature@bookmarks           enabled                     local
    tank  feature@filesystem_limits   enabled                     local
    tank  feature@large_blocks        enabled                     local
    

    zfs get all tank/test

    NAME       PROPERTY               VALUE                  SOURCE
    tank/test  type                   filesystem             -
    tank/test  creation               Do Jul 21 10:04 2016   -
    tank/test  used                   19K                    -
    tank/test  available              17,0T                  -
    tank/test  referenced             19K                    -
    tank/test  compressratio          1.00x                  -
    tank/test  mounted                yes                    -
    tank/test  quota                  none                   default
    tank/test  reservation            none                   default
    tank/test  recordsize             128K                   default
    tank/test  mountpoint             /tank/test             inherited from tank
    tank/test  sharenfs               off                    default
    tank/test  checksum               on                     default
    tank/test  compression            off                    default
    tank/test  atime                  off                    local
    tank/test  devices                on                     default
    tank/test  exec                   on                     default
    tank/test  setuid                 on                     default
    tank/test  readonly               off                    default
    tank/test  zoned                  off                    default
    tank/test  snapdir                hidden                 default
    tank/test  aclinherit             restricted             default
    tank/test  canmount               on                     default
    tank/test  xattr                  on                     default
    tank/test  copies                 1                      default
    tank/test  version                5                      -
    tank/test  utf8only               off                    -
    tank/test  normalization          none                   -
    tank/test  casesensitivity        mixed                  -
    tank/test  vscan                  off                    default
    tank/test  nbmand                 off                    default
    tank/test  sharesmb               off                    default
    tank/test  refquota               none                   default
    tank/test  refreservation         none                   default
    tank/test  primarycache           all                    default
    tank/test  secondarycache         all                    default
    tank/test  usedbysnapshots        0                      -
    tank/test  usedbydataset          19K                    -
    tank/test  usedbychildren         0                      -
    tank/test  usedbyrefreservation   0                      -
    tank/test  logbias                latency                default
    tank/test  dedup                  off                    default
    tank/test  mlslabel               none                   default
    tank/test  sync                   disabled               local
    tank/test  refcompressratio       1.00x                  -
    tank/test  written                19K                    -
    tank/test  logicalused            9,50K                  -
    tank/test  logicalreferenced      9,50K                  -
    tank/test  filesystem_limit       none                   default
    tank/test  snapshot_limit         none                   default
    tank/test  filesystem_count       none                   default
    tank/test  snapshot_count         none                   default
    tank/test  snapdev                hidden                 default
    tank/test  acltype                off                    default
    tank/test  context                none                   default
    tank/test  fscontext              none                   default
    tank/test  defcontext             none                   default
    tank/test  rootcontext            none                   default
    tank/test  relatime               off                    default
    tank/test  redundant_metadata     all                    default
    tank/test  overlay                off                    default
    tank/test  com.sun:auto-snapshot  true                   inherited from tank
    

    Can you give me a hint what I could do to improve the write speeds?

    Update 1

    After your comments about the storage system I went to the IT department. The guy told me that the logical block which the vdev exports is actually 512 B.

    This is the output of dmesg:

    [    8.948835] sd 3:0:0:0: [sdb] 68717412272 512-byte logical blocks: (35.2 TB/32.0 TiB)
    [    8.948839] sd 3:0:0:0: [sdb] 4096-byte physical blocks
    [    8.950145] sd 3:0:0:0: [sdb] Write Protect is off
    [    8.950149] sd 3:0:0:0: [sdb] Mode Sense: 43 00 10 08
    [    8.950731] sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, supports DPO and FUA
    [    8.985168]  sdb: sdb1 sdb9
    [    8.987957] sd 3:0:0:0: [sdb] Attached SCSI disk
    

    So 512 B logical blocks but 4096 B physical block?!

    They provide me some temporary file system to which I can backup the data. Then, I will first test the speed on the raw device before setting up the pool from scratch. I will send an update.

    Update 2

    I destroyed the original pool. Then I ran some speed tests using dd, the results are ok - around 80MB/s in both directions.

    As a further check I created an ext4 partition on the device. I copied a large zip file to this ext4 partition and the average write speed is around 40MB/s. Not great, but enough for my purposes.

    I continued by creating a new storage pool with the following commands

    zpool create -o ashift=12 tank /dev/disk/by-id/scsi-3600140519581e55ec004cbb80c32784d
    zfs set compression=on tank
    zfs set atime=off tank
    zfs create tank/test
    

    Then, I again copied a zip file to the newly create test file system. The write speed is poor, just around 2-5 MB/s.

    Any ideas?

    Update 3

    tgx_syncis blocked when I copy the files. I opened a ticket on the github repository of ZoL.

    • ewwhite
      ewwhite almost 8 years
      Do we know anything about how the storage device is connected to the VM? Also, you don't appear to have compression enabled.
    • BayerSe
      BayerSe almost 8 years
      They say it is 10GbE. On the test file system I disabled compression on purpose to be not bound by the CPU. The results are approximately the same, however, no matter whether compression is enabled or not.
    • user121391
      user121391 almost 8 years
      Network throughput would only be of concern if you do not get more than 110 MB/s, which is far beyond your current speed. You need to ask them about the kind of storage subsystem, the maximum, average and minimum expected performance for random and sequential access, and the blocksize on which it is aligned.
    • Andrew Henle
      Andrew Henle almost 8 years
      What's the raw disk write performance? Can you test that? Because if the raw disk can't meet your performance requirements, there's no file system in the universe that will save you.
    • BayerSe
      BayerSe almost 8 years
      @AndrewHenle In the IT department they tested the read speed of the raw disk using dd. It is about 90 MB/s (as opposed to abut 40-50 MB/s) on the file system. I'll add write speed results.
    • Andrew Henle
      Andrew Henle almost 8 years
      @BayerSe dd testing will test sequential read/write performance. Sequential operations like that are often coalesced into large blocks to/from the actual disk(s) via caching and the use of either read-ahead or write-behind. File system access can be extremely random and in small blocks, which doesn't lend itself to caching or read-ahead. A disk system can give good large-block sequential performance while still having abysmal random, small-block performance - especially write performance. dd testing is an easy start, because if it's poor, everything else will also be poor.
    • user121391
      user121391 almost 8 years
      "So 512 B logical blocks but 4096 B physical block?!" That is (was) not that uncommon - newer disks used 4k bytes sectors internally, but presented 512 bytes to the operating system, known as "4k/512e" ("4k emulated") as opposed to the older 512/512 ("512 native") or the newer 4k/4k ("4k native").
    • not-a-user
      not-a-user almost 7 years
      Any progress on this? I have the same issue on arch/armv7. Somehow it seems neither CPU bound (frequency governor does not scale up) nor IO bound (the same crappy 4 M/s write speed for both an hdd as well as an emmc-backed loop device). Is your Ubuntu guest 32 bits or 64 bits? (What does uname -a say?)
    • BayerSe
      BayerSe almost 7 years
      @not-a-user Linux TAQ-Server 4.4.0-92-generic #115-Ubuntu SMP Thu Aug 10 09:04:33 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux is the output. I never directly solved the problem, but after re-creating the block device and applying some settings discussed here: list.zfsonlinux.org/pipermail/zfs-discuss/2016-July/025979.h‌​tml, the problem vabished.
    • BayerSe
      BayerSe almost 7 years
      @not-a-user This is what I ised at the end: gist.github.com/BayerSe/393b4664d42b85ade63660fb1f357482
    • Rob Pearson
      Rob Pearson about 6 years
      For 30tb, ZFS needs FAR more RAM than 12GB. You should be at 48GB minimum. Math for it - 8GB baseline + 30GB (1GB per managed TB) = 38GB But 38 isn't a sensible size so your next stop is 48GB. serverfault.com/questions/569354/…
  • ewwhite
    ewwhite almost 8 years
    The storage is abstracted. It's probably an export from a SAN. The ashift may not make a difference here.
  • BayerSe
    BayerSe almost 8 years
    I'm confused. The zdbcommand says ashift=0, zpool get all says it's 9. What is the correct value? And what could I ask the IT guys to figure out whether ashift=12 would be the correct value?
  • Tero Kilkanen
    Tero Kilkanen almost 8 years
    Actually zdb tells that the iSCSI device has ashift=9 and zpool get all says it is 0. I don't actually know what is the minimum write block used when ashift is 0. You can try both ashift=9 and ashift=12. You need to ask, what is the minimum block size for the storage system that doesn't trigger read-modify-write during writes.
  • Andrew Henle
    Andrew Henle almost 8 years
    No real answers are possible without detailed knowledge of what that iSCSI disk actually is. For all we know it's a 37-disk RAID5 array of mixed 5400 and 7200 RPM SATA drives with a per-disk segment size of 1 MB that's been partitioned into 137 LUNs that are all utterly misaligned. If something like that is true (and I've seen incompetent SAN setups like that all too often), the OPs task likely hopeless. If the system is still in test and the ZFS file system can be safely destroyed, raw disk write performance using something like Bonnie or even just dd would be a good data point to have.