Optimal ARC and L2ARC settings for purpose specific storage application

performance memory storage zfs

5,407

First, I really suggest you to reconsider your layout for pools n.2 and n.3: a 3-way mirror is not going to give you low latency, nor high bandwidth. Rather than an expensive 1 TB NVMe disk for L2ARC (which, by the way, is unbalanced due to the small 32 GB ARC), I would use more 7200 RPM disks in a RAID10 fashion or even cheaper but reliable SSDs (eg: Samsung 850 Pro/Evo or Crucial MX500).

At the very least, you can put all disks on a single RAID10 pool (with SSD L2ARC) and segment the single pool by the virtue of multiple datasets.

That said, you can specify how ARC/L2ARC should be used on a dataset-by-dataset base by using the primarycache and secondarycache options:

zfs set primarycache=none <dataset1> ; zfs set secondarycache=none <dataset1> will disable any ARC/L2ARC caching for the dataset. You can also issue zfs set logbias=throughput <dataset1> to privilege througput rather than latency during write operations;
zfs set primarycache=metadata <dataset2> will enable metadata-only caching for the second dataset. Please note that L2ARC is feed by the ARC; this means that if ARC is caching metadata only, the same will be true for L2ARC;
leave ARC/L2ARC default option for the third dataset.

Finally, you can set your ZFS instance to use more than (the default of) 50% of your RAM for ARC (look for zfs_arc_max in the module man page)

5,407

dtech

Updated on September 18, 2022

Comments

dtech over 1 year

I am configuring a server that runs 3 ZFS pools, 2 of which are rather purpose specific and I feel like the default recommendations are simply not optimized for them. Networking is facilitated by dual 10gbit adapters.

Pool 1 is a big file storage, it contains raw video data that is rarely written and read, and also occasional backups. There is absolutely no point in caching anything from that pool, as it is high bandwidth data that is read through in one sweep beginning to end, caching anything from it will be a complete waste of memory. Latency is not that much of an issue, and bandwidth is ample due to highly compressible data. The pool is made of 8 HDDs in z2 mode, usable capacity of 24TB.

Pool 2 is compressed video frames storage. Portions of this content are frequently read when compositing video projects. The portion of frequently used data is usually higher than the total amount of RAM the server has, there is a low latency requirement, but not ultra low, bandwidth is more important. The pool is made of 3 HDDs in z1, usable capacity of 8TB, and a 1TB NVME SSD for L2ARC.

Pool 3 is general storage used as storage for several computer systems that boot and run software from it rather than local storage. Since it has to service several machines and primary system storage, the requirements for latency and bandwidth here are the highest. This pool is mostly read from, writes are limited to what the client systems do. The pool is made of 3 SATA SSDs in z1 mode, 1TB of usable capacity.

My intent at optimization has to do with minimizing the ARC size for the first two pools in order to maximize the ARC size for the third one.

Pool 1 has no benefit from caching whatsoever, so what's the minimum safe amount of ARC I can set for it?

Pool 2 can benefit from ARC but it is not really worth it, as L2ARC is fast enough for the purpose and the drive has 1 TB of capacity. Ideally, I would be happy if I could get away without using any ARC for this volume, and using the full terabyte of L2ARC, but it seems that at least some ARC is needed for L2ARC header data.

So considering L2ARC capacity of 1 TB and pool record size of 64k, 1tb / 64kb * 70b gives me ~0.995gb. Does this mean I can safely cap the ARC for that pool at 1GB? Or maybe it needs more?

It seems that the ARC contains both read cache as well as the information to handle the L2ARC, so it looks like what I need is some option to give emphasis to managing a larger L2ARC than bother with caching actual data in RAM. And if necessary, mandate that any cache evictions from ARC are moved to L2ARC in the event cache eviction policies do not abide to usual caching hierarchy policies.

The general recommendations I've read suggest about 1GB of RAM per 1TB of storage, I am planning 32GB of RAM per 33 TB of storage which I am almost dead on, but 4 or 5 to 1 for L2ARC vs ARC, which I fall short of by quite a lot. The goal is to cut pool 1 ARC as low as possible, and cut pool 2 ARC to only as much as it needs in order to be able to utilize the whole 1TB of L2ARC, in order to maximize the RAM available for ARC for pool 3.
dtech about 6 years

The idea behind pool 2 is that the low latency and high bandwidth will come from the L2ARC, applied to the files that are in recent use. The HDDs are there for bulk storage for cold frames and some basic redundancy. RAM caching is overkill for that pool, which is why I want to make it to pseudo-tier between the nvme ssd for hot files and the hdds for cold files. I am not that keen on raid 10 as it sacrifices quite a lot of capacity and redundancy.
shodanshok about 6 years

In ZFS, L2ARC is not a magical "low latency/high bandwidth" bullet. As it is only feed from the ARC (and never from the disks), you can quite often end with data which are evicted from ARC before they are pushed on L2ARC. Moreover, L2ARC is not persistent during reboot (albeit work is in progress for that), so a reboot will drop any data which were on L2ARC (with much slower access time to user data).
dtech about 6 years

Reboots are not a problem, with my setup a reboot would only take place in the event of hardware failure that cannot be addressed via hotswap. I don't know any implementation details, but it seems unreasonable that cache might be evicted from ARC without being pushed to L2ARC, why would something like that happen? Shouldn't cache be moved from L1 to L2? That's how cache hierarchies usually work. Is that not the case with zfs?
dtech about 6 years

I mean, from what I know about cache, access to a sector or line or whatever caches it at L1, and it is usually FIFO, so as soon as cache capacity runs out, the first cached entity is pushed to a higher level of cache or purged if there isn't any. So everything that is read should end up in the L2 cache until it fills up and is displaced.
shodanshok about 6 years

@dtech because L2ARC is not a cache in the stricter sense. It is feed by ARC, but it does not directly receive evicted ARC entries. This is by design - to leave off the critical path the slower L2ARC. You can read more here. For reboots: how are you planning to patch kernel or libc level bugs/CVE? You had to plan for reboots.
dtech about 6 years

I am a fan of the "if it works don't touch it". That machine is presumed to work as expected and do only what it is intended to. Vulnerability is a non-issue since that network segment is offline. Which naturally removes any patching or updating out of the equation. There is also a on site power generator that lasts for 8 hours unsupervised and indefinitely as long as there is someone to pour diesel. And even if that runs out, the system is hibernated rather than shut down, ram content is saved so L2ARC should remain relevant, right?
dtech about 6 years

For the cache eviction, if the The current algorithm puts segments that are hit, an L2ARC cache hit, at the top of the list such that a segment with no hits gets evicted sooner is indeed implemented, that would be even better than FIFO, although for my use case, even FIFO, which is pretty much the dumbest caching policy, would cut it. I can increase record size even further, overhead will not be significant because files on that pool are no smaller than several megabytes. It would be nice to have an option to configure what the ARC is used for and give priority to L2ARC header data.
shodanshok about 6 years

@dtech you are missing the fact that, under memory pressure, a block cached in ARC can be evicted before the feeding thread put it in the L2ARC. In short: do not underplay the importance of main pool IOPS. L2ARC is of great help, but a 3-way mirror will not give you outstanding performance. If you can live with that, great. But you should really profile/simulate your expected IOPS load.
dtech about 6 years

The whole point of the question is to alleviate memory pressure by allocating it where it matters the most. Maybe there is a way to enforce that evicted cache is mandatory moved to L2ARC, even at the cost of a stall and IOPS hit for that pool?
shodanshok about 6 years

@dtech no, what you are asking is not possible. To maximize L2ARC you could enable L2 prefetching l2arc_noprefetch=0, but it has its drawbacks. To "allocate cache where it matters", you had to use the above primarycache/secondarycache paramters.
dtech about 6 years

Drawbacks aside, I don't see how this may be beneficial in my case. Moving pre-fetches that didn't get a hit to the L2ARC doesn't seem beneficial at all. Sounds like more flash wear and cache pollution. Disabling prefetching might be useful though, in terms of keeping data that might not be useful out of the ARC. On a side note, if there exists an option to go as far as to cache prefetches that don't get a hit, there should be a guarantee or at least an option to ensure the caching of the data from the ARC that actually gets hit.
shodanshok about 6 years

@dtech what you are asking (direct ARC->L2ARC eviction) simply do not exists on ZFS. Feel free to ask on the mailing list or even check the source code.
dtech about 6 years

Too bad if true, it is not an unreasonable thing to expect.