Pros and cons of software Parity-RAID (e.g. RAID5)

raid mdadm raid5

6,563

Solution 1

I assume Linux's software RAID is as reliable as a hardware RAID card without a BBU and with write-back caching enabled. After all, uncommitted data in a software RAID system resides in the kernel's buffer cache, which is a form of write-back caching without battery backup.

Since every hardware RAID-5 card I have ever used allows you to enable write-back caching without having a BBU, I expect software RAID-5 can work okay for people with a certain level of risk tolerance.

ObWarStory:

That having been said, I have personally experienced serious data loss due to having no BBU installed on a RAID-5 card though write-back caching was enabled. (No UPS, either. Don't yell at me, not my call.)

My boss called me in a panic while I was on vacation because one of our production systems wouldn't come back up after a power outage. He'd run out of things to try. I had to pull off to the side of the road, pull out the laptop, turn on WiFi tethering on my phone, ssh into the stricken system, and fix it, while my family sat there with me on the side of the road until I finished restoring a roached database table from backup. (We were about a mile away from losing cell reception at the time.)

So tell me: how much would you pay for a RAID card + BBU now?

Solution 2

Just a warning notice : RAID-5/6 write operations take a significant CPU time while your array is degraded. If your server is already fully loaded when a disk comes to fail, it may drop into an abyss of unresponsiveness. Such problem won't happen with a hardware RAID controller. So I'd strongly advise against using software RAID-5/6 on a production server. For a workstation or lightly loaded server, it's OK though.

Solution 3

SW RAID does have a failure mode - if the server goes down halfway through a write you can get a corrupted stripe. A HW RAID controller with a BBU isn't all that expensive, and it will retain dirty blocks until you can restart the disks.

The BBU on the cache does not guarantee writes in the event of power failure (i.e. it does not power the disks). It powers the cache for a few days until you can re-start the disks. Then the controller will flush any dirty buffers to disk.

Some notes about SW vs. HW RAID-5

Writes on a SW RAID-5 volume can be slow if write-through caching is used with blocking I/O, as the call doesn't return until all the I/O has completed. A HW RAID controller with a BBWC can optimise this considerably, so you can see substantially better performance.
The last time I looked you couldn't do direct I/O (i.e. zero-copy DMA) on a SW RAID volume. This may have changed and is really only relevant to applications like database managers using raw partitions.
A modern SAS RAID controller can pull or push 1GB/sec or more of data off a disk array, particularly if formatted with a large (say 256kb) stripe size. I've even benchmarked an older Adaptec ASR-2200s at speeds that indicated it was pretty much saturating both its scsi channels at 600MB/sec+ in aggregate (10x 15k disks) with very little CPU load on the host machine. I'm not sure you could get that out of software RAID-5 without a lot of CPU load if at all, even on a modern machine. Maybe you could read that quickly.
Configuration for booting off a HW RAID volume is simple - the RAID volume is transparent to the O/S.

A low-end RAID controller from a tier-1 vendor such as adaptec is not that expensive at retail street prices and can be purchased for peanuts off ebay. But remember, if you buy secondhand, stick to tier-1 vendors and make sure you know the model and verify the avialability of drivers from their web site.

Edit: From @psusi's comment, make sure you don't get a fakeraid (transparent SW RAID hidden in the driver) controller, but most of the offerings from the bigger names (Adaptec, 3Ware or LSI) aren't fakeraid units. Anything that can take a BBU won't be fakeraid.

Solution 4

Linux mdadm software raid is designed to be just as reliable as a hardware raid with battery backed cache. There are no problems with sudden loss of power, beyond those that also apply to sudden power loss on a single disk.

When the system comes back up after power fail, the array will be resynchronized, which basically means that the parity is recomputed to match the data that was written before the power failure. It takes some time, but really, no big deal. The resynchronize time can be greatly reduced by enabling the write-intent bitmap.

Solution 5

Here is a blog explaining the issue with RAID5 and how ZFS RAIDZ is resolving it.

Its key points are :

RAID-5 (and other data/parity schemes such as RAID-4, RAID-6, even-odd, and Row Diagonal Parity) never quite delivered on the RAID promise -- and can't -- due to a fatal flaw known as the RAID-5 write hole. Whenever you update the data in a RAID stripe you must also update the parity, so that all disks XOR to zero -- it's that equation that allows you to reconstruct data when a disk fails. The problem is that there's no way to update two or more disks atomically, so RAID stripes can become damaged during a crash or power outage.

and

RAID-Z is a data/parity scheme like RAID-5, but it uses dynamic stripe width. Every block is its own RAID-Z stripe, regardless of blocksize. This means that every RAID-Z write is a full-stripe write. This, when combined with the copy-on-write transactional semantics of ZFS, completely eliminates the RAID write hole.

View more solutions

6,563

user773568

Updated on September 18, 2022

Comments

user773568 almost 2 years

I was recently told about some problems concerning Parity-RAIDs without a non-volatile cache. More exspensive HW-controllers do have battery-powered caches to finish write-operations in case of power failure. Now, some people say that such a failure, perhaps in combination with a degraded array, may kill your whole filesystem. Others claim that those issues are outdated and/or misconceptions.

Unfortunately, nobody gives hard references and neither a search for md RAID and non-volatile cache, nor for bitmap caching gives reliable answers about if md-RAID5 is advisable or not.

Any information about that?
- Gilles 'SO- stop being evil' almost 13 years
  
  Linux's mdraid does have a journal, which at least is safe in case of sudden halt (i.e. system crash, or all disks power down at once), or in the case of RAID-1. I don't know if its RAID-[56] copes with all forms of staged power down.
- psusi almost 13 years
  
  @Gilles, it does not have a journal, but it does have an optional write-intent bitmap. This just speeds up the process of resynchronization when the system comes back up though, because it identifies what areas need resynchronized, and what areas can be skipped.
- Gilles 'SO- stop being evil' almost 13 years
  
  @psusi I went back and checked what I'd researched a few months back. As far as I understand, the write-intent bitmap (not a journal, my mistake) does more than speed up resynchronization, it indicates which of the components have a dirty block, so that's enough to ensure block consistency on RAID-1 but not on RAID-5 (where you might not have enough disks in both the old and the new state to restore either state).
- psusi almost 13 years
  
  @Gilles I'm not sure what you mean by "bock consistency". You seem to be hung up on the entire stripe being in either the state before the write started, or the state after. This is never guaranteed and is entirely likely not to be the case if the power fails in the middle of the write. It is up to the filesystem to handle this just like it does on a single disk; using the journal.
user773568 almost 13 years

What do you think about a journal on a fast SSD?
user773568 almost 13 years

That sounds a bit optimistic. How can a pure software solution be as reliable as a battery backed cache?
Gilles 'SO- stop being evil' almost 13 years

There are bad things that can happen to a RAID array that can't happen to a single disk. With a single disk, every sector is in either the old or the new state. With e.g. RAID-5 over 4+1 disks, what if sector 42 of disks 1 and 2 are in the old state and sector 42 of disks 3, 4 and 5 are in the new state? Neither the old state nor the new state are recoverable. I don't know if Linux takes measures to avoid this, and this is what the question is about.
BrettRobi almost 13 years

That'd do it, but at that moment you're paying more than a decent controller in the first place ;) Also, speed goes up, but the reliability goes down, because most SSD's die very very quickly.
psusi almost 13 years

@user773568 umm... I just explained how?
psusi almost 13 years

@Gilles you have just restated the same case as the single disk. Some sectors are in the old state, and some are in the new state. It doesn't matter which disk they are on. Filesystems deal with incomplete writes during a crash with the journal.
psusi almost 13 years

@Marcin what do you base that on? They don't seem to have shorter specified design lifetime, and I have had one for over a year and have only used 5% of its write cycles.
Gilles 'SO- stop being evil' almost 13 years

@psusi No: with a single disk, each sector is in either the new state or the old state. With multiple disks, if the driver uses the naive approach of overwriting the sector on each disk without storing information elsewhere, a sector that was in a transitory state (old state on some disks, new state on others) cannot be recovered at all. The error can possibly be detected (if you're lucky: the parity could match by accident), but it cannot be corrected.
psusi almost 13 years

@Gilles generally speaking, yes, disks tend to either completely write the sector, or not, but filesystems can not, and do not rely on this behavior. The fs must assume that if the write did not complete successfully, the contents of the sector are completely lost, and so it recovers using the journal. In the event that a disk fails during the power outage, and that disk contained data ( not parity ) on the in flight stripe, then the data would be regenerated randomly from the out of date parity. The fs journal recovery will handle this though.
user773568 almost 13 years

@MarcinWell When you say reliability, is that savety or availability? I did not expect that loosing my journal threatens the overall data. Does it? Anyway I plan to put OS and Swap on a smaller SSD, due to noise and power-saving reasons. The RAID can go to sleep that way.
Alen Milakovic almost 13 years

I use mdadm software raid. A reasonable related question is - are there settings I can use that will make sw raid safer? Also, how about a good link for "write-intent bitmap"? According to this there are performance issues with enabling it.
Nils over 12 years

Right - I deleted my comment. But a raid without BBU should write through, shouldn't it? This is at least what the PERC-controllers do when the battery learns and falls below threshold.
Warren Young over 12 years

Yes, without a BBU or with a dead BBU, the RAID card still writes data. What it doesn't do is remember what was in the write buffer when power fails to the server. Since RAID depends on consistency among the redundant bits, it gets confused when it becomes inconsistent. Therefore, a power failure during RAID writes risks corrupting something on the RAID, because the controller is forced to pick one of the two-or-more copies of the data, not knowing which is correct.
psusi over 12 years

If power fails in the middle of a write, then you get a stripe that is out of sync not corrupted. An out of sync stripe just means that the parity is not up to date, so when the array is mounted, the parity must be updated. Also those "raid" controllers that can be had for peanuts are often fakeraid; they have bios rom extensions and windows drivers that do the raid in software.
ConcernedOfTunbridgeWells over 12 years

@psusi - Most of the ASR-2200s controllers I bought a few years ago were under 100 USD and they're pukka HW RAID controllers. I don't think Adaptec actually make fakeraid controllers at all. You can quite readily get 4 or 8 port Adaptec, 3Ware or LSI SAS RAID controllers off ebay for a few hundred dollars.
ConcernedOfTunbridgeWells over 12 years

@psusi - You appear to make the assumption that a journalled file system is in use. There are applications (e.g. databases, which do their own journalling) where you might not want to use a journalled file system.
psusi over 12 years

@ConcernedOfTunbridgeWells, whether the journaling is done by the filesystem or a database makes no difference; the point is that it needs doing just the same whether you are using a single disk or a raid.
psusi over 12 years

I wouldn't call a few hundred bucks for a used product from an unknown source "pennies"; that indicates more along the line of $50-$100 for a new product. Devices in that class are usually fakeraid.
ConcernedOfTunbridgeWells over 12 years

@psusi - It seems that the assertion that SW RAID is as reliable as a system with a BBU because the journal will ensure consistency. That's technically correct if you have a journalling mechanism but it also implies write-through caching. Write-back caching - even with a small cache like 64MB - gives quite a substantial performance win unless your application is specifically designed to use write-through I/O, which it probably isn't. I think the statement of eqivalence is a bit over-reaching in practice.
ConcernedOfTunbridgeWells over 12 years

@psusi - You are trying to rebuff an argument that I never made; I never used the word 'pennies' at all. Please do not resort to straw man arguments - the examples I used are not fakeraid controllers.
ConcernedOfTunbridgeWells over 12 years

MLC SSDs have a track record of reliability issues. SLC units are much more reliable, but also much more expensive. A white paper about SSD reliability can be found here
psusi over 12 years

@ConcernedOfTunbridgeWells, no, it does not assume write-through caching. Just as they do with a single disk, filesystems make use of the write-back cache and then explicitly flush at key points to ensure consistency. In other words, you write several mb to the journal where it may sit in the write-back cache for a while, but when you later get around to writing the data to the main area of the disk, you flush the writes to the journal first.
ConcernedOfTunbridgeWells over 12 years

@psusi - in that case it has a failure mode that a HW RAID controller is not susceptible to and therefore it is not as reliable as a HW RAID controller with a BB cache. If you lose power on a host with SW RAID then cached data will be lost. The BBU on the controller will keep the cache powered so it can retain cached data if the host loses power.
psusi over 12 years

@ConcernedOfTunbridgeWells, if you loose power on a single disk, the cache is lost as well; that is why the filesystem uses a journal, which handles the problem just as well on a sw raid.
ConcernedOfTunbridgeWells over 12 years

@psusi - OK, so we all understand that file systems periodically checkpoint and flush buffers. Flushing more than one disk block is not an atomic operation. If the data is cached in volatile memory then it is still possible to lose data and consistency in a power failure. If the cache is powered you can flush to the HW cache on the controller (much quicker than directly to disk) and have it resume the flush when the system comes back up. Now, perhaps you're right and I'm wrong, but if you want to back up your position it might be better to explain it in more detail in the posting.
psusi over 12 years

@ConcernedOfTunbridgeWells, I'm not sure why you aren't getting this. Either the write completed or it did not. It makes no difference whether the write is to a single disk, or involves multiple disks -- either it completed ( in its entirety ) or it did not. The journal ensures that incomplete writes are either rolled back or completed when the system comes back up. The journal does not require any kind of atomicity to do this either.
ConcernedOfTunbridgeWells over 12 years

@psusi - you are technically correct except for your statement that write-through caching need not be assumed. A journal needs to guarantee the order of writes so it will typically force I/O completion (i.e. write-through strategy on the journal). Nobody's disputing that a journalling mechanism will ensure consistent writes. Given that this question is about mdadm, I guess we can assume that the entire world is a modern linux distro with a journalled file system, so your argument is certainly valid in practice, at least for the purposes of the question.
Alexandr Priymak almost 11 years

How did SW Raid knows which blocks are out-of-date? Or it just recomputes all parity blocks after every reboot (this sounds strange)
psusi almost 11 years

@AlexandrPriymak, if you enable the write-intent bitmap, then it keeps track of what parts of the disk may be out of date ( by flagging them in the write intent bitmap before actually writing to them ) so only those areas need recomputed. Otherwise, the whole array has its parity blocks recomputed after a crash.