ZFS: Configuration advice 1x NVMe as ARC and ZIL and 4x SSDs for zvols for virtualization

9,214

Solution 1

If someone wonders. We think the main problem is RAM (our ARC is limited at 4GB so and everything else is eaten by the system). The deal with ZFS at the moment - it is not ready for SSDs and/or NVMe. It has been made for HDDs, slow and bulky with their silly heads, mechanics and predictable issues.

With SSDs and NVMe ZFS performs silly things they don't need and does not do things they actually need. Back when ZFS has been invented nothing more of cache has been thought of non-volatile RAM.

Now we can put 4x pcie SSDs in a system with 4TB of space.

There are 2 ways to handle this in such case. Either give it enough memory and let it perform properly on your SSDs with the overheads it provides. Or not use ZFS.

It's a shame because its structural benefits are pretty good. But it can't handle SSDs properly without higher RAM usage than with HDDs because all the settings and configuration tell it "underlying systems are slow, cache is necessary, read small, write big and sequential" when SSDs are fast, don't need cache, can read big and write big and do random properly. With Optane such problems will be obvious.

Things that are more-or-less not needed are extensive caching, checksumming of each file at the record level (it doesn't make sense because if you have bitrot at SSD level you should throw away the whole drive as there are no uses for such system because it can have broken controller ruining your whole data, it's similar to bad RAM). SIL is not needed at all. ARC is also useless, especially with Optane drives (it adds more overhead to CPU and RAM). Record size should be perfectly limited to write in transactions drive understands.

Or just use LVM for KVM provisioning on systems. Thin-provisioning is not perfect there but at least you don't need to waste extremely valuable RAM on making your SSDs perform at the level they are supposed to.

Solution 2

First, ZFS checksumming is not redundant: it is an end-to-end (RAM to physical media) cheksum, while HDD/SSD checksum in as "inside media" error control. To have something similar with classical filesystem, you had to use T10/DIF-compatible disks and controllers, which SATA devices lack (you are forced to use SAS SSD, which are much more expensive).

That said, low write performance with ZVOLs are generally due to to the very small default 8K block size, which is small enough to greatly increase metadata overhead but not small enough to prevent read-modify-writes cycles for 4K writes.

Another problem with consumer SATA SSD disks (as your Samsung 850 EVO) is that they do not have any powerloss-protected cache, so ZFS constantly flushes them for metadata writeout and synchronous data writes.

Anyway, you should really give us more details on your benchmark methodology end real-world expected workload to obtain an accurate answer.

Solution 3

The performance is poor because the ZFS defaults are not ideal for what you're doing. Do you have anything in /etc/modprobe.d/zfs.conf? If not, it requires some tuning.

  • Will the VMs be running on the same server as the ZFS installation?
  • If so, ZIL is not necessary; it's only useful for synchronous write activity, like presenting NFS to VMware and some databases.
  • I use 128K block size for ZFS storage on native disks.
  • For Linux, zvols need to be volblocksize=128K
  • I use ashift=13 for all-SSD ZFS zpools, ashift=12 for everything else.
  • Don't disable ARC. Limit it if necessary, but it sounds like you don't have much RAM.
  • Don't disable checksumming.
  • DO enable LZ4 compression! No reason not to.
  • What are you intending to do with NVMe + 4xSSD?

Solution 4

Specifically if someone uses docker (as i do), UFS is not a real production solution if you build regularly or have many containers and volumes (as i do :) ) .

Since docker is able to use a ZFS backend, there will still be some people who want to use SSDs & Optane on their system running ZFS.

@Andrew I ran into some of the same problems you did, & had to fix my problems with massive RAM (32G for ARC). the overall server now has 128GB of RAM but can do amazing performance few systems can.

Another set of people will be people running ZFS stripes on AWS to work around the burstIO - essentially all your EBS SSD volumes are just waiting to begin showing SATA 5.4K like performance as soon as your burst balance declines. In this kind of a situation I see ZFS suddenly switching to large sequential IO to keep up. so as long as my applications monitor the burstbalance & reduces IO, ZFS is going to try keep things sane.

I expect something very similar is experienced by VMWare people when their multi tiered hypervirtualized-beyond-sanity storage array begins to try to dynamically manage performance in dire times of heavy IO & rising latency

I am aware of storage system designs in which essentially large RAM cache is used as the write pool - this basically means the storage reports all writes to be cache hits & the staging to disk happens later

At least with ZFS i know real programmers made it.

So there is still some value left with ZFS on SSDs :) - it depends on the kinds of problems you run into.

Share:
9,214
Ajay
Author by

Ajay

Updated on September 18, 2022

Comments

  • Ajay
    Ajay almost 2 years

    So recently upon testing a ZoL system we found out poor performance of random and sequential reads and poor performance of random writes on our SSDs.

    Our system is a stripe of 2x Samsung 1TB 850Evo SSDs to test ZFS performance and it was abysmal compared to LVM: reads are slower than HDDs and writes are not on par of expected 1.7GBs we get on LVMs. It's weird because our FreeBSD backup server has slow HDDs and older type of SSD and performs better at the same test.

    The system is somewhat deprived of RAM though (zfs gets 4gb for arc and everything else is taken by VMs) however with no cache and no sync the performance is still not even close to anything.

    So we are looking into buying newer systems based on AMD Epyc and setting up either full NVMe or NVMe with SSDs with disabling cache to free up ram from ZFS at least a bit (we want it to use 10GB max for everything). We don't really need all of security features of ZFS apart checksumming (but with SSDs it seems to be redundant as SSD run an inside checksumming system) so SSDs will be a stripe of vdevs.

    We prefer ZFS for zle on thin-provisioned zvols and the ease of snapshotting and incremental backups to remote system (which runs ZFS as well).

    However the struggle for performance is hard...

    Would greatly appreciate any advice

    • ewwhite
      ewwhite over 6 years
      Also, what do you intend to do with NVMe+4xSSD?
  • Ajay
    Ajay over 6 years
    So regarding ZFS checksumming. As I understand it is mostly to protect the integrity of the data, it is not important for ZVOLs though, because an FS inside will perform the check of integrity of the data, that is why checksumming is not necessary for us (main dataset has checksum). Yes we found out that Samsung 850 Evo are not 4k drives only after setup, but we have 4x to 20x read and write amplification 4x is the lowest when we disable checksumming and metadata and syncing. logbias is already throughput by the way.
  • Ajay
    Ajay over 6 years
    So we found out that 512 bytes (so default ashift) should've been used. The ashift=12 might increase our rw 4x but still 8K performed the best on our benchmarks compared to 4K, 16K, 64K and 128K. The 20x amplification was also sustained due to debian defaults of using ext journal, disabling journaling in a vm fs greatly improved performance. With ZFS inside VM - overhead is incredible. So yeah the 512b sector size sucks. But as I said our next one is going to be either an NVMe only or NVMe + 4x SSDs and boy we are going to prepare. What matters here if the NVMe + 4x SSD will do any good.
  • shodanshok
    shodanshok over 6 years
    @Andrew checksumming is important, unless you use another checksumming filesystem (eg: ZFS or BTRFS) inside your virtual machine (and even in this case, checksum at the raw block level enable smarter error recovery). That said, try running fio with 8K read/write size and you should see much higher write performance. Even better, recreate your ZVOLs with biggers volblocksize (eg: 32/64/128K) and run fio with blocksize matching your volblocksize. Regarding ashift: using ashift=12 (4K) was the right choice (it strikes the right balance between space/performance/write amplification).
  • Ajay
    Ajay over 6 years
    VMs run on the same server. 128K block size add 4x more amplification to the number we get (so total is 12x to 24x). ashift=13 is bad for the SSDs I listed but good for any NVMe or PCIe drive. Yes as I said we try to limit ARC as much as possible. But anyways LVM is doing fine without caching so what's wrong with that? Same about checksumming, LVM has none and is ok. LZ4 is on. We intend to provide the fastest VPS services on a budget. Within our range of hardware turnover it is completely reasonable not to setup redundancy at a system level.
  • Ajay
    Ajay over 6 years
    We did with different sizes and as I mentioned already 8K performed the best in all categories. But it still falls far short from our ext4, LVM on host (like 10 times slower) or even ZFS on host (4-2x slower)
  • Ajay
    Ajay over 6 years
    Here's our rc.local pastebin.com/AkNCrYWM which applies everything on boot
  • krad
    krad almost 6 years
    One thing to consider is are you using the best virtualization technique? ie if you are mainly running linux vms, could they not be containerised with lxc (not docker)? Most stuff will just work in lxc/lxd without much modification. A lot of the io issues you have will go away, as its all to do with block alignment through the multiple layers of file systems, disk image formats, and volume management. Containerize where you can, then leave the rest for KVM,QEMU etc
  • Daniel Dinnyes
    Daniel Dinnyes over 4 years
    I would be interested to hear an update regarding your upgrades to NVMe SSD and an EPYC rig. Also, I would like to understand better why both you and @demorphica say that more RAM can alleviate these issues with NVMe SSD on ZFS. Also, did the recomendations by demorphica matter to you case?
  • Ajay
    Ajay over 4 years
    Hello, we went with lvmthin. ZFS is not really helping when used within cloud environment. It's a waste of ram most of the times. For homeserver system it's a godsend though