Do I really need a ZIL SLOG?

nas zfs freenas

5,930

Solution 1

Thanks go to user121391 for the answer. I wanted to follow up on one point though in rather more detail than a comment would allow.

My understanding is that any non-PLP device is worse than useless [...]

The purpose of the ZIL is to provide the assurances of synchronous writes without the overhead (and other problems) of actually doing synchronous writes.

Actual synchronous writes are potentially slow (due to writing to the "slower" main storage pool) and less efficient due to not being able to wait and batch writes together. The former isn't as much of an issue with an SSD pool, but the latter can matter, especially if applications are doing something pathological such as writing a few bytes at a time to a file and then issuing a sync.

The ZIL is intended to be a "safe" write buffer that records a log of pending writes, so that in the event of a failure (loss of power, kernel panic, etc.), writes are not lost. Without a separate ZIL device ("SLOG"), the ZIL is normally mirrored to the primary storage pool, which is of course slow. It also means twice as many writes, which for an SSD pool, is bad.

This leaves us with four options, which mostly resemble the ones in my question:

Disable the ZIL and write directly to the pool.
Use a non-PLP SLOG.
Use a PLP SLOG.
Keep the ZIL in RAM only.

Here, I'm only concerned with why #2 is worse than any other option... More accurately, it is strictly worse than #4. (Except it isn't, but we'll get back to that.)

The ZIL is a write-often, read-almost-never buffer. The only case in which the ZIL is read is if the system goes down hard (e.g. power loss, kernel panic, etc.). In such an instance, the ZIL is used to reconstruct writes that the user/application was assured are safely committed to disk, but actually weren't.

From a safety standpoint, it is critical that, once the ZIL SLOG reports that data has been written, it really has been written. However, non-PLP devices often use internal (to the hardware) caching and lie about this. Thus, if the system goes down suddenly, a non-PLP device is at risk of corrupting the ZIL, just as the in-RAM ZIL would be lost. Thus, in the event of a power failure, a non-PLP SLOG is at serious risk of offering no advantage over only keeping the ZIL in RAM.

OTOH, if the system goes down for other causes, a non-PLP ZIL will hopefully still finish writing before the system loses power (if it ever does; I'm not sure a soft reset is even noticed by the hardware) and so would still protect the pool. A cheap SSD is therefore better than RAM-only ZIL, but only for cases where the system goes down without losing power. In the event of a power loss, a non-PLP ZIL is about as dangerous as RAM-only ZIL, but it's also slower than RAM-only ZIL; all of the drawbacks of using an SLOG with PLP with only part of the advantages.

Intel Optane presents an interesting case; while it doesn't technically have PLP, it also theoretically is just as safe without needing PLP. The reason goes back to why we need PLP in the first place; drives have internal caches and lie about synchronous writes to improve performance. As I understand it, Optane has no caching, so when ZFS does its synchronous writes to the ZIL (only then reporting back to the user/application that the sync is completed), the writes really are safely committed to the device.

Optane is interesting (read: makes a really good choice for an SLOG) in some other ways. While it may not be fastest on paper, it has low latency, which I've seen reported as the most important factor in ZIL performance. The 900p also has absolutely stunning write endurance (5.1 PBW is about 3.5× the closest competitor I've seen, around 8× most "traditional" SSD's in its price range, and a whopping 24× compared to the 800P which is a third the price). Combine that with the way it tends to blow everything else out of the water in benchmarks designed to simulate the ZIL, and the choices really come down to those in my original question: a) 900P, b) RAM-only, or c) disable ZIL.

...The last is probably fine performance-wise, but the drive endurance aspects make me jittery.

Solution 2

First things first: great post, well written and thoroughly researched! You have already found out many of the answers yourself, so I just add some thoughts to it.

My suggestion: just try it out. Buy the minimal viable configuration (just your pool disks) and set it up, then use it and check if it is fast enough (my guess if you don't serve VMs directly off of it would be "yes").

If not, then add a separate device. You can add and remove as many as you like (mirrored) without destroying the pool.

You should also evaluate why you want your specific all-SSD design. It will cost you more per GB and most likely is wasted for your needs (for details, see last paragraph of this answer). On the other hand, if your goal is silence and simplicity in a small form factor, it would be a viable alternative, in which case I would personally not invest in a separate log device because of cost (performance would be fine anyway).

Add an SLOG. My understanding is that any non-PLP device is worse than useless (at least, worse than the two other options to follow), but in theory I'd be okay with an Intel 900p, which seems to be the cheapest option.

It depends on the device and the specific case... in general PLP is recommended because on a 5000$ storage array it does not pay to save 50$ when choosing inferior SSDs, because it is just not worth the tiny probability that something will go wrong. Especially as you don't need large SSDs, a few GB are plenty. But if one disk fails and the other does not, or does not fail at the same sectors (depends on disk/on-device-controller), you would be fine. Thing is, no one can guarantee it, so you try to add safety measures and reduce your risk.

You can compare it with pool design. A single disk is cheap, but if something goes wrong you are screwed. Now you can either buy a better disk, or two cheap ones, or two better ones, or three cheap ones and so on... in the end you must calculate your risk and balance it with your budget. Extremes are bad in most cases (single cheap disk vs. 32-way-mirror of the best disks available), so healthy middle ground is often advised (which would be two-way-mirror for small and three-way-mirror for large deployments).

Your SLOG device just needs three things:

Better sustained IOPS than your weakest pool disk. If the SLOG device is slower, you'd be better off just using the pool without one, or even adding it to the pool as another mirror (slower, but safer).
High write endurance. If it is low, you must replace it sooner. In your case this should not matter much, because write volume would be very low either way and replacement can be staggered by adding the second disk a few days later.
Reliability, which determines the likelyhood of critical faults. Can be improved by using multiple disks (mirrors), by using PLP, by using quality disks (or all three at the same time, see above). Again, this depends on the speed (random IOPS) of your pool disks.

Use RAM-only ZIL. (Basically, lie about sync.) While this "sounds" bad, AFAIU it won't affect the integrity of my pool, and for my usage, I'm not sure it's worth the extra cost. Any sort of failure is going to risk that I lose data, just because the NAS is offline. Most likely I'll know right away that something went wrong and will be able to take remedial steps of some sort (e.g. save to some other location temporarily until I can get the NAS back up).

I would only do this if you have a specialized application running on it that knows that it can't trust the storage completely - for example in a multi-mirror (other storage nodes) setup where periodic sync detects any errors in nodes that were down temporarily.

If you just have a single machine, you would get cheap performance with greatly added risk. A single unclean accidental shutdown (button pressed or cable to UPS yanked by kid, etc) could pose risks and certainly takes a toll on your peace of mind.

Don't use a ZIL; force all writes to go straight to disk. While this would be a performance killer on spinning rust, it's not clear that the ZIL is even useful on an all-SSD system, since the main concerns that make it useful with spinning rust (slower drives, seek latency) don't apply.

It only affects performance for those special sync random writes... you did not state what the machine would be used for, but if I imagine typical home use, then it would be:

Regular client backups (assuming Windows/SMB/CIFS): not affected because sequential writes on async datastores
Long-time archival of family data: not affected because mostly data at rest and will be copied the same way as backups
Serving media libraries to clients, encluding encoding: not affected because read only and transient encoding goes not back to storage
Running a few virtual machines for services like mail, cloud access etc.: would theoretically affected, but you would not notice the performance if just a few clients access it (and not, say, a thousand or more per second)

Even normal disks would satisfy those constraints easily (as mirrors), SSDs would just blaze through it. Your machine will be idle most of the time and your money would be better invested in more storage or a better offsite backup system.

5,930

Matthew

Updated on September 18, 2022

Comments

Matthew almost 2 years
I'm building a NAS for home/personal use that will use ZFS (probably running FreeNAS) over a SATA SSD array. In terms of usage, I expect the system to be "idle" (not counting ZFS background stuff e.g. scrubs) more often than not, and I expect the main performance bottleneck to be the 1Gb ethernet.

I'm vaguely familiar with the ZIL, but confused about the use of a SLOG / secondary ZIL storage device.

The system is going to be on a UPS, so it's probably more likely to experience something like a kernel panic than a sudden loss of system power. Regardless, as long as it doesn't eat my pool, I'm not particularly bothered that a catastrophic event might cause the last few minute's data to be lost. (Remember, this is a home system, not something mission critical.)

In particular, is it possible to have only the primary ZIL in RAM (and what would be the "safety" impact of that)? If I don't have a dedicated SLOG device, does that mean I am forced to use the storage pool for SLOG, and what is the actual impact (both performance and wear) of that? ~~If I do need a dedicated device, is a high-performing NVMe SSD (e.g. a modest-sized WD Black SN750) sufficient or do I really need to spend $250 on an Intel 900p?~~ (see update)

Most of what I've been able to find says "yes, that $250 900p is absolutely vital", but don't really explain what situation I'd be in if I omit it.

Update:

So, most of what I've read is that a) having the ZIL be on the primary pool is horrible (halves performance and, even worse for SSD's, doubles writes), and b) the main reason a ZIL is "needed" is to reduce latency for sync writes. Given that my pool is all-SSD (albeit SATA, but OTOH my users are all bottle-necked by, at best, 1Gb LAN), it seems like I have three options:
- Add an SLOG. My understanding is that any non-PLP device is worse than useless (at least, worse than the two other options to follow), but in theory I'd be okay with an Intel 900p, which seems to be the cheapest option.
- Use RAM-only ZIL. (Basically, lie about sync.) While this "sounds" bad, AFAIU it won't affect the integrity of my pool, and for my usage, I'm not sure it's worth the extra cost. Any sort of failure is going to risk that I lose data, just because the NAS is offline. Most likely I'll know right away that something went wrong and will be able to take remedial steps of some sort (e.g. save to some other location temporarily until I can get the NAS back up).
- Don't use a ZIL; force all writes to go straight to disk. While this would be a performance killer on spinning rust, it's not clear that the ZIL is even useful on an all-SSD system, since the main concerns that make it useful with spinning rust (slower drives, seek latency) don't apply.
As far as option #2, I've seen threads where folks note that "data integrity is data integrity"... except, really, it isn't. There is a significant difference between losing whatever file I just tried to write (and likely knowing immediately that something went wrong), when I stand an excellent chance of being able to manually recover somehow, and losing files that were created months or years ago. I can see how, in general, this could go either way, but in my case, I'm more concerned with archival integrity.
Matthew about 5 years

Thanks for the answer! Reasons for SSD's are much as you've mentioned: small size, lower power use, silence... which ties into your observation "Your machine will be idle most of the time". Yes, it will, and SSD's just sitting there make me much less twitchy than HDD's spinning constantly. Maybe that's not rational, but then, humans aren't very rational. (And, yes, I also need to work on my offline backup!) Anyway, the main use is data storage... long term, not much rewriting but a fair mix of reads and additions, plus some private git hosting.
Matthew about 5 years

As I understand it, skipping the ZIL makes writes less efficient... on an SSD array, that's sub-optimal for endurance, and since longevity is part of the point of this exercise, I'm not looking fondly at that option. So I'm at "is it worth $250 to guard against the possibility of losing a few seconds worth of writes if the system goes down hard?".