How to monitor BTRFS filesystem raid for errors?

26,132

Solution 1

There doesn't appear to be a daemon or utility that officially reports BTRFS events for user handling. The closest alternative is to monitor the system log for messages from BTRFS and react accordingly.

http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html

The above link provides more details for configuring a script (sec package on Debian or SEC) designed for general-purpose log monitoring to act on unexpected log messages concerning BTRFS. It also depends on having a regularly scheduled scrub of the filesystem to check for bit-rot and emit log entries as a preemptive measure. Below is an excerpt specific to the SEC script:

How to configure sec, event correlator to report btrfs filesystem errors or warnings

After installing sec.pl (apt-get install sec on debian or http://simple-evcorr.sourceforge.net/), install the 2 config files below.

This is not foolproof, it relies on a regex of known messages that are ok, and reports all unknown ones. You can extend the forward looking negative regex as needed.

polgara:~\# cat /etc/default/sec  
\#Defaults for sec  
RUN_DAEMON="yes"  
DAEMON_ARGS="-conf=/etc/sec.conf -input=/var/log/syslog -pid=/var/run/sec.pid -detach -log=/var/log/sec.log"

polgara:~# cat /etc/sec.conf  
\# http://simple-evcorr.sourceforge.net/man.html  
\# http://sixshooter.v6.thrupoint.net/SEC-examples/article.html  
\# http://sixshooter.v6.thrupoint.net/SEC-examples/article-part2.html  
type=SingleWithSuppress  
ptype=RegExp  
pattern=(?i)kernel.*btrfs: (?!disk space caching is enabled|use ssd allocation|use .* compression|unlinked .* orphans|turning on discard|device label .* devid .* transid|detected SSD devices, enabling SSD mode|has skinny extents|device label|creating UUID tree|checking UUID tree|setting .* feature flag|bdev.* flush 0, corrupt 0, gen 0)  
window=60  
desc=Btrfs unexpected log  
action=pipe '%t: $0' /usr/bin/mail -s "sec: %s" root

Solution 2

In addition to the regular logging system, BTRFS does have a stats command, which keeps track of errors (including read, write and corruption/checksum errors) per drive:

# btrfs device stats /
[/dev/mapper/luks-123].write_io_errs   0
[/dev/mapper/luks-123].read_io_errs    0
[/dev/mapper/luks-123].flush_io_errs   0
[/dev/mapper/luks-123].corruption_errs 0
[/dev/mapper/luks-123].generation_errs 0

So you could create a simple root cronjob:

[email protected]
@hourly /sbin/btrfs device stats /data | grep -vE ' 0$'

This will check for positive error counts every hour and send you an email. Obviously, you would test such a scenario (for example by causing corruption or removing the grep) to verify that the email notification works.

In addition, with advanced filesystems like BTRFS (that have checksumming) it's often recommended to schedule a scrub every couple of weeks to detect silent corruption caused by a bad drive.

@monthly /sbin/btrfs scrub start -Bq /data

The -B option will keep the scrub in the foreground, so that you will see the results in the email cron sends you. Otherwise, it'll run in the background and you would have to remember to check the results manually as they would not be in the email.

Update: Improved grep as suggested by Michael Kjörling, thanks.

Update 2: Additional notes on scrubbing vs. regular read operations (this doesn't just apply to BTRFS only):
As pointed out by Ioan, a scrub can take many hours, depending on the size and type of the array (and other factors), even more than a day in some cases. And it is an active scan, it won't detect future errors - the goal of a scrub is to find and fix errors on your drives at that point in time. But as with other RAID systems, it is recommended to schedule periodic scrubs. It's true that a typical i/o operation, like reading a file, does check if the data that was read is actually correct. But consider a simple mirror - if the first copy of the file is damaged, maybe by a drive that's about to die, but the second copy, which is correct, is actually read by BTRFS, then BTRFS won't know that there is corruption on one of the drives. This is simply because the requested data has been received, it matches the checksum BTRFS has stored for this file, so there's no need for BTRFS to read the other copy. This means that even if you specifically read a file that you know is corrupted on one drive, there is no guarantee that the corruption will be detected by this read operation.
Now, let's assume that BTRFS only ever reads from the good drive, no scrub is run that would detect the damage on the bad drive, and then the good drive goes bad as well - the result would be data loss (at least BTRFS would know which files are still correct and will still allow you to read those). Of course, this is a simplified example; in reality, BTRFS won't always read from one drive and ignore the other.
But the point is that periodic scrubs are important because they will find (and fix) errors that regular read operations won't necessarily detect.

Faulted drives: Since this question is quite popular, I'd like to point out that this "monitoring solution" is for detecting problems with possibly bad drives (e.g., dying drive causing errors but still accessible).

On the other hand, if a drive is suddenly gone (disconnected or completely dead rather than dying and producing errors), it would be a faulted drive (ZFS would mark such a drive as FAULTED). Unfortunately, BTRFS may not realize that a drive is gone while the filesystem is mounted, as pointed out in this mailing list entry from 09/2015 (it's possible that this has been patched):

The difference is that we have code to detect a device not being present at mount, we don't have code (yet) to detect it dropping on a mounted filesystem. Why having proper detection for a device disappearing does not appear to be a priority, I have no idea, but that is a separate issue from mount behavior.

https://www.mail-archive.com/[email protected]/msg46598.html

There'd be tons of error messages in dmesg by that time, so grepping dmesg might not be reliable.
For a server using BTRFS, it might be an idea to have a custom check (cron job) that sends an alert if at least one of the drives in the RAID array is gone, i.e., not accessible anymore...

Solution 3

As of btrfs-progs v4.11.1 stats has the --check option that will return non-zero if any of the values are not zero, removing the need for the regex.

device stats -c /

Solution 4

I would not rely on the stats command for error notification, because this command returns no error if a drive suddenly goes away. You can test it by disconnecting a sata cable or pulling a drive - not recommended with an important file system.

btrfs device stats /

After a reboot, btrfs shows missing drive(s), but that may be too late.

btrfs fi show

Solution 5

Sounds like a task for system monitoring. There exists a check which implements the Nagios Plugin API called: check_btrfs. As you can see in the source code, it has a function called check_dev_stats which checks for device stats and will go critical if any of the values is non-zero. It also checks for allocation issues. What remains unclear is how the check behaves if one disk is absent or goes offline.

PS: The plugin is packaged in Debian: monitoring-plugins-btrfs

Share:
26,132

Related videos on Youtube

Ioan
Author by

Ioan

Updated on September 18, 2022

Comments

  • Ioan
    Ioan almost 2 years

    I saw some documentation on a daemon that can execute a program/script for various BTRFS events, but I cannot find it anymore.

    How can I have a script/program be executed on a drive failure for a BTRFS raid1 array? I would like to run a script on any error to act as an early warning for a potentially failing drive, but the actual drive failure is most important. I would like to unmount the filesystem at that point (if that's not what BTRFS does anyway) and set an alarm.

    • Ioan
      Ioan about 9 years
      I had a RAID5 have two drives fail in a short time of each other. I was looking to set up a new system using BTRFS' mirror raid capability and quickly react to drive problems (not necessarily drive failure) to reduce the chance of further damage while providing time to deal with the original cause. I'm hoping BTRFS' N-way mirror will someday work well.
    • basic6
      basic6 over 8 years
      @Ioan: This is why RAID-5 is not always recommended and RAID-6 should be used instead. A resilver will put a lot of stress on all drives, which could cause a second drive, which may be about to go bad, to fail during this process. Unlike RAID-5, RAID-6 can handle that (you could also remount it read-only and update your backup before replacing the second drive).
  • user
    user over 8 years
    Wouldn't something like grep -vE ' 0$' be better?
  • basic6
    basic6 over 8 years
    @MichaelKjörling: Good idea, I've updated my answer, thank you!
  • Ioan
    Ioan over 8 years
    This is a nice idea, and I do it as a regular integrity check. However, it can take much longer than an hour to checksum all the data. Not to mention the wear on hardware if running it continuously to pick up on the errors. BTRFS does checksumming of all normal filesystem operations and that would be a more efficient way to immediately react to them.
  • basic6
    basic6 over 8 years
    @loan: You are correct, a scrub can run for many hours, so it obviously puts a lot of stress on the drives. But it's done to detect silent corruption, so you can replace a bad drive before another one goes bad too. Silent corruption doesn't happen during normal fs operations, so you won't be informed automatically.
  • Ioan
    Ioan over 8 years
    @basic6: Absolutely, and this is great for that. However, it does nothing for detecting errors during normal operation, such as a degraded BTRFS array, until the next scrub. Silent corruption can be dealt with using a monthly scrub for efficiency, but that's too long for other errors.
  • basic6
    basic6 over 8 years
    @loan: Again, you are correct. I have added a hypothetical example to show the difference and highlight the importance of scrubbing.