When is fsck dangerous?

linux filesystems mount fsck data-consistency

11,615

Solution 1

fsck definitely causes more harm than good if the underlying hardware is somehow damaged; bad CPU, bad RAM, a dying hard drive, disk controller gone bad... in those cases more corruption is inevitable.

If in doubt, it's a good idea to just to take an image of the corrupted disk with dd_rescue or some other tool, and then see if you can successfully fix that image. That way you still have the original setup available.

Solution 2

You have seen one example where fsck worked, but I've seen more then enough damaged file systems where it did not work successfully at all. If it would work fully automatic, you might have no chance to do things like a dd disk dump or something like that which in many cases would be an excellent idea to do before attempting a repair.

It's never, ever a good idea to try something like that automatic at all.

Oh, and modern servers should have remote consoles or at least, independent rescue systems to recover from something like that without lugging a KVM rack to the server.

11,615

scristalli

Updated on September 18, 2022

Comments

scristalli over 1 year
Recently I've seen the root filesystem of a machine in a remote datacenter get remounted read-only, as a result of consistency issues.

On reboot, this error was shown:
```
UNEXPECTED INCONSISTENCY: RUN fsck MANUALLY (i.e., without -a or -p options)
```
After running fsck as suggested, and accepting the corrections manually with Y, the errors were corrected and the system is now fine.

Now, I think that it would be interesting if fsck was configured to run and repair everything automatically, since the only alternative in some cases (like this one) is going in person to the remote datacenter and attach a console to the affected machine.

My question is: why does fsck by default ask for manual intervention? How and when a correction performed by such program would be unsafe? Which are the cases when the sysadmin might want to leave a suggested correction aside for some time (to perform some other operations) or abort it alltogether?
jorfus almost 8 years

I've worked a lot with failing hardware and I agree with this. The last thing I want to do is fsck if there's suspected bad hardware of any sort. I've also seen a low power event and subsequent recovery which was greatly delayed by automatic fsck.
Eric Towers almost 8 years

To give a concrete example: I have worked on a machine with a disk controller that "randomly" (about 1 time in 10^5) would turn a read or a write to block XXXXXXYY on any device to a write to block 000000YY on the first device. I.e., it frequently blasted structured wrong and unstructured wrong data to the boot sector and various critical filesystem structures of the boot disk. Running fsck in such a situation (millions of reads) can eliminate any remaining chance of recovering data.
Nelson almost 8 years

1 in 10^5 is a lot... that's 10 bytes ever Mb.
Eric Towers almost 8 years

@Nelson : It sort of is... The unit there is "single block transfers", not "bytes". So ten bad block writes per million blocks (and blocks are significantly larger than bytes).
TOOGAM almost 8 years

Actually, what's not a good idea is to say "never, ever" like that, when it isn't true. Usage case where it is a good idea: The server's main partitions can be re-created from scratch rather quickly, in case of problem. Actually important data gets accessed via a remote filesystem, with appropriate redundancy in place for that data. I'd much rather take the chance of fsck -p / and fsck -p /var, etc., working fine, and getting server up without manual intervention, and risk the small, non-zero % chance of major catastrophe to those partitions which I can just re-create if needed.
FooBee almost 8 years

If the system can be easily reinstalled, I just do that ...
TOOGAM almost 8 years

That would take longer. Options are: A) Risk doing it automatically. B) Have someone tell fsck to preen, and then everything works fine. Takes about 2 minutes, if that. Downtime until this happens. C) Have someone re-install the operating system. Takes 30+ minutes. You're choosing option C? Maybe a key difference we have is that I've had fsck work a greater percentage of the time than what you quote in your answer. My main point wasn't the system design (this cheap-o system doesn't use a remote console), but just that saying "never, ever" was too strong a phrase to be accurate
FooBee almost 8 years

Let's just agree to disagree.