Is bit rot on hard drives a real problem? What can be done about it?

raid hard-drive zfs

33,120

Solution 1

First off: Your file system may not have checksums, but your hard drive itself has them. There's S.M.A.R.T., for example. Once one bit too many got flipped, the error can't be corrected, of course. And if you're really unlucky, bits can change in such a way that the checksum won't become invalid; then the error won't even be detected. So, nasty things can happen; but the claim that a random bit flipping will instantly corrupt you data is bogus.

However, yes, when you put trillions of bits on a hard drive, they won't stay like that forever; that's a real problem! ZFS can do integrity checking every time data is read; this is similar to what your hard drive already does itself, but it's another safeguard for which you're sacrificing some space, so you're increasing resilience against data corruption.

When your file system is good enough, the probability of an error occurring without being detected becomes so low that you don't have to care about that any longer and you might decide that having checksums built into the data storage format you're using is unnecessary.

Either way: no, it's not impossible to detect.

But a file system, by itself, can never be a guarantee that every failure can be recovered from; it's not a silver bullet. You still must have backups and a plan/algorithm for what to do when an error has been detected.

Solution 2

Yes it is a problem, mainly as the drive sizes go up. Most SATA drives have a URE (uncorrectable read error) rate of 10^14. Or for every 12TB of data read statistically the drive vendor says the drive will return a read fail (you normally can look them up on the drive spec sheets). The drive will continue to work just fine for all other parts of the drive. Enterprise FC & SCSI drive generally have a URE rate of 10^15 (120TB) along with a small number of SATA drives which helps reduce it.

I've never seen to disks stop rotating at the exact same time, but I have had a raid5 volume hit this issue (5 years ago with 5400RPM consumer PATA drives). Drive fails, it's marked dead and a rebuild occurs to the spare drive. Problem is that during the rebuild a second drive is unable to read that one little block of data. Depending upon whos doing the raid the entire volume might be dead or just that little block may be dead. Assuming it's only that one block is dead, if you try to read it you'll get an error but if you write to it the drive will remap it to another location.

There are multiple methods to protect against: raid6 (or equivalent) which protects against double disk failure is best, additional ones are a URE aware filesystem such as ZFS, using smaller raid groups so statistically you have a lower chance of hitting the the URE drive limits (mirror large drives or raid5 smaller drives), disk scrubbing & SMART also helps but is not really a protection in itself but used in addition to one of the above methods.

I manage close to 3000 spindles in arrays, and the arrays are constantly scrubbing the drives looking for latent URE's. And I receive a fairly constant stream of them (every time it finds one it fixes it ahead of the drive failure and alerts me), if I was using raid5 instead of raid6 and one of the drives went completely dead... I'd be in trouble if it hit certain locations.

Solution 3

Hard drives do not generally encode data bits as single magnetic domains -- hard drive manufacturers have always been aware that magnetic domains could flip, and build in error detection and correction to drives.

If a bit flips, the drive contains enough redundant data that it can and will be corrected the next time that sector is read. You can see this if you check the SMART stats on the drive, as the 'Correctable error rate'.

Depending on the details of the drive, it should even be able to recover from more than one flipped bit in a sector. There will be a limit to the number of flipped bits that can be silently corrected, and probably another limit to the number of flipped bits that can be detected as an error (even if there is no longer enough reliable data to correct it)

This all adds up to the fact that hard drives can automatically correct most errors as they happen, and can reliably detect most of the rest. You would have to be have a large number of bit errors in a single sector, that all occurred before that sector was read again, and the errors would have to be such that the internal error detection codes see it as valid data again, before you would ever have a silent failure. It's not impossible, and I'm sure that companies operating very large data centres do see it happen (or rather, it occurs and they don't see it happen), but it's certainly not as big a problem as you might think.

Solution 4

Modern hard drives (since 199x) have not only checksums but also ECC, which can detect and correct quite a bit "random" bit rot. See: http://en.wikipedia.org/wiki/S.M.A.R.T.

On the other hand, certain bugs in firmware and device drivers can also corrupt data in rare (otherwise QA would catch the bugs) occasions which would be hard to detect if you don't have higher level checksums. Early device drivers for SATA and NICs had corrupted data on both Linux and Solaris.

ZFS checksums mostly aim at the bugs in lower level software. Newer storage/database system like Hypertable also have checksums for every update to guard against bugs in filesystems :)

Solution 5

Theoretically, this is cause for concern. Practically speaking, this is part of the reason that we keep child/parent/grandparent backups. Annual backups need to be kept for at least 5 years, IMO, and if you've got a case of this going back farther than that, the file is obviously not that important.

Unless you're dealing with bits that could potentially liquify someone's brain, I'm not sure the risk vs. reward is quite up to the point of changing file systems.

View more solutions

33,120

scobi

Updated on September 17, 2022

Comments

scobi over 1 year

A friend is talking with me about the problem of bit rot - bits on drives randomly flipping, corrupting data. Incredibly rare, but with enough time it could be a problem, and it's impossible to detect.

The drive wouldn't consider it to be a bad sector, and backups would just think the file has changed. There's no checksum involved to validate integrity. Even in a RAID setup, the difference would be detected but there would be no way to know which mirror copy is correct.

Is this a real problem? And if so, what can be done about it? My friend is recommending zfs as a solution, but I can't imagine flattening our file servers at work, putting on Solaris and zfs..
- scobi over 14 years
  
  Here's an article on it: web.archive.org/web/20090228135946/http://www.sun.com/bigadm‌in/…
- theking2 about 2 years
  
  Here is the official Oracle doc on scrub and bit rot
scobi over 14 years

I don't see how child/parent/grandparent backups helps. There's no way to know with that system if a bit is flipped because a user intended to change it or if the drive did it on its own. Not without a checksum of some kind.
Amok over 14 years

Having multiple backups won't help if you don't know that the data in them is good. You can manually checksum your files, but ZFS does so much more automatically and makes filesystem management easy.
Kara Marfia over 14 years

Having backups that go back farther than a week/month increases your chance of having a good copy of the file. I probably could've been clearer about that.
scobi over 14 years

The problem is: how do you know you have a bad copy? And how do you know which copy that is backed up is the good one? In an automated way.
Kara Marfia over 14 years

I've seen maybe one file every few years fall to corruption that may be a result of bit rot, but I may be suffering from Small Fish Syndrome. I could understand talk of backups being useless, and I'll delete if it's offensive. It was time well spent reading the other answers, regardless. ;)
duffbeer703 over 14 years

Be careful, data integrity is not a feature of all RAID systems.
scobi over 14 years

Oh, no reason to delete, this is a good discussion. The problem I've got is the "seen" in "seen maybe one file". It requires going back and validating by hand every file, in order to notice that something is wrong. The backups will definitely help if you do have the multi level backup, and you've noticed a bad file. You can go back until you find one that's good. I'm just concerned about these increasingly enormous data stores we're building having hidden demons that show up right when it's time to ship a product. That's when everything feels like it goes wrong.
scobi over 14 years

Ok, according to wikipedia (en.wikipedia.org/wiki/Error_detection_and_correction) modern hard drives use CRC's to detect errors and try to recover using compact disc style error recovery. That's good enough for me.
Amok over 14 years

RAID can't tell which data is good and which isn't so it can't fix errors, it can just detect them.
Matt Rogish over 14 years

Amuck: Not as part of the "RAID Standard", per se, but advanced RAID systems (firmwares, etc.) do that
Angiosperm over 14 years

But if the CRC is stored in the same location (sector) as the data this won't help for all error cases. E.g. if there is a head positioning error data could be written to a wrong sector - but with a correct checksum => you wouldn't be able to detect the problem. That's why checksums in ZFS are stored separately from the data they protect.
Alex over 9 years

Actually, I regularly have bit-rot errors (in parts I don't read much), which the system silently recovers from (incorrectly). If at least it notified me there was bit-rot, I could re-read the data to recover it before it became unrecoverable; and if unrecoverable, I'd be able to compare it to the other hard drive.
Jay Sullivan over 9 years

What units are you speaking in? "10^14" is not a "rate".
Jo Liss almost 9 years

The unit would be e.g. "10^14 bits read per error", which equals 12 TB read per error.
user over 8 years

And of course, keeping in mind that the error rate is normally quoted in terms of full sector errors per bits read. So when a manufacturer states URE rates at 10^-14, what they really mean is that the probability of any random sector read hitting a URE is 10^-14 and if it does, then the whole sector comes back as unreadable. That and the fact that this is statistics; in the real world, UREs tend to come in batches.
Brian D. almost 8 years

@ Michael Dillion - RAID6 reliability does not increase as you increase the number of drives. For all data there is only the original data + 2 parity. Increasing drive number is worse for reliability as it increases the possible drive failure rate without increasing redundancy of any data. The only reason to increase drive numbers, is to increase your available storage size.
Brian D. almost 8 years

@ Amok - RAID6 has 2 parity locations, so data is represented in 3 locations. Every controller I have seen can easily detect and fix volume data integrity issues when using RAID6 due to this fact.
Brian D. almost 8 years

Alex, please check your HDD SMART data, and system RAM to verify there is not another issue causing the corruption. Bit rot/random corruption is extremely rare, so there may be something else going on with your machine.
TomTom over 7 years

Does ZFS has a maintenance like Windows has now? That basically rewrites the data regularly to refresh the magnetic coding.
Alex over 7 years

@BrianD. One issue was, I kept the hard drives inside their (insulated) packing material; this was causing hard drives to heat over 60°C while working, for days on end. Does that sound like a legitimate reason why bit rot might have occurred?
Jody Bruchon about 7 years

Modern hard drives do not use CRCs, they use Hamming code which is very different. It's the same thing that ECC memory uses. One-bit flip errors can be corrected, two-bit flip errors can be detected but not corrected, three or more bits flipping and the data is actually damaged. In any case, there is no replacement for data backups. ZFS and other filesystems do not provide any better protection than the Hamming code on a drive's platters does. If the data is damaged then ZFS won't save you.
Erin Schoonover about 5 years

@JodyLeeBruchon You got a source on Hamming code being used predominantly now? What info gathering I've been doing lately has indicated that drive makers are still using CRC-RS. 1 2
Jody Bruchon about 5 years

@IanSchoonover No, and now that you mention it, I don't know where I got that info from anymore. It's been over two years since I wrote that. It is not quite correct, but I can no longer edit it to correct it.