Prevent data corruption on ext4/Linux drive on power loss

21,345

Solution 1

The write cache has usually nothing to do with the BIOS, mostly there is no option for switching disk cache settings in there. With linux, using hdparm -W 0 should help.

The setting is persistent, so if you don't have hdparm to play around with in your production systems, you should be able to disable the disk write cache on a different system and replug the disk.

BTW: I'd second the idea of a non-writable root filesystem (so your system could boot in a kind of "recovery mode" and allow for remote access even if the writable filesystem is not mountable for some reason). And if you can change the hardware design, consider using mtd devices instead of IDE/SATA disks with a flash-aware filesystem like jffs2. We've been using this combination with several embedded devices (mostly VPN router solutions in the field) for several years with good results.

Update: the root of your problem seems to be that you are running an ext4 filesystem with journaling disabled - has_journal is missing from the Filesystem features list. Just shut down all services, check if anything still has open files using lsof +f -- /, remount your root partition read-only with mount -o remount,ro /, enable the journal with tune2fs -O has_journal /dev/sda1 and set up the "ordered" journal mode as the default mount option using tune2fs -o journal_data_ordered /dev/sda1 - you will have to re-run fsck (preferably from a rescue system) and remount root / reboot after this operation.

With these settings in place, the metadata is guaranteed to be recoverable from the journal even in the event of a sudden power failure. The actual data is also consistently written to disk, although you may see data of several seconds before the power outage lost on bootup. If this is not acceptable, you might consider using the tune2fs -o journal_data /dev/sda1 mount option with your filesystem - this would include all data written to disk in the journal - this obviously would give you better data consistency but at the cost of a performance penalty and a higher wear level on your SSD.

Solution 2

The write cache suggestion is a good start but this sounds like an architectural design flaw. On an embedded system the internal flash should probably NOT be mounted R/W except in rare circumstances. You should really be doing most of the work in a memory filesystem and syncing changes back to the RW flash upon some user command or regular interval. It is really uncommon for an embedded system to use a regular filesystem (like ext4) in rw mode during normal operation. If there is some application requirement where you need lots of storage space you should consider having your system partition be different and designing it such that the data partition can be fsck -y'ed as part of startup.

If you need some starting points I would look at how people setup Diskless Linux systems:

http://frank.harvard.edu/~coldwell/diskless/

and start from there. The general idea is that your system binaries and data can be mounted read-only so your filesystem won't be corrupted. However you need to be able to write to certain areas, so you need something to usually memory filesystem /tmp, /var/tmp. Even if certain things need to be writable you just create a script to mount the partition as r+w and then commit the changes, then go back to read-only.

A really great example of this is the Cyclades hardware, its embedded linux and whenever you make configuration changes you have to execute a save script which actually rebundles the configs and writes them out to the flash.

Share:
21,345

Related videos on Youtube

Jonathan Henson
Author by

Jonathan Henson

Updated on September 18, 2022

Comments

  • Jonathan Henson
    Jonathan Henson over 1 year

    I have some embedded boards running American Megatrends bios with embedded linux as the OS. The problem I have is that the industrial flash ide's will be corrupted on power loss. I have them formatted as ext4. Whenever this happens, I can usually fix the flash with fsck, but this will not be possible in our deployments. I have heard that disabling the write-caching should help, but I can't figure out how to do it. Also, is there any thing else I should do?

    More Info

    The drive is a 4gb ide flash module. I have one partition which is ext4. The O.S. is installed on that partition and grub is my bootloader.

    fdisk -l shows /dev/sda as my flash module with /dev/sda1 as my primary partition.

    After a power loss I usually cannot make it entirely through the boot init scripts.

    When I mount the drive on another P.C. I run fsck /dev/sda1. It always shows messages like

    "zero datetime on node 1553 ... fix (y)?"
    

    I fix them and it boots fine until the next power loss.

    When I get to the office tomorrow, I will post the actual output of fdisk -l

    This is all I know about how the system works. I am not a systems guy, I am a Software Engineer that has a habit of getting into predicaments that are outside of his job description. I know how to format drives, install a bootloader, write software, and hack on an operating system.

    Here is the output from dumpe2fs

    #sudo dumpe2fs /dev/sda1
    dumpe2fs 1.41.12 (17-May-2010)
    Filesystem volume name:   VideoServer
    Last mounted on:          /
    Filesystem UUID:          9cba62b0-8038-4913-be30-8eb211b23d78
    Filesystem magic number:  0xEF53
    Filesystem revision #:    1 (dynamic)
    Filesystem features:      ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
    Filesystem flags:         signed_directory_hash 
    Default mount options:    (none)
    Filesystem state:         not clean
    Errors behavior:          Continue
    Filesystem OS type:       Linux
    Inode count:              245760
    Block count:              977949
    Reserved block count:     48896
    Free blocks:              158584
    Free inodes:              102920
    First block:              0
    Block size:               4096
    Fragment size:            4096
    Reserved GDT blocks:      239
    Blocks per group:         32768
    Fragments per group:      32768
    Inodes per group:         8192
    Inode blocks per group:   512
    Flex block group size:    16
    Filesystem created:       Fri Feb  4 15:12:00 2011
    Last mount time:          Sun Oct  2 23:48:37 2011
    Last write time:          Mon Oct  3 16:34:01 2011
    Mount count:              2
    Maximum mount count:      26
    Last checked:             Tue Oct  4 07:44:50 2011
    Check interval:           15552000 (6 months)
    Next check after:         Sun Apr  1 07:44:50 2012
    Lifetime writes:          21 GB
    Reserved blocks uid:      0 (user root)
    Reserved blocks gid:      0 (group root)
    First inode:              11
    Inode size:           256
    Required extra isize:     28
    Desired extra isize:      28
    Default directory hash:   half_md4
    Directory Hash Seed:      249d2b79-1e20-49a3-b324-6cb631294a63
    Journal backup:           inode blocks
    
  • Jonathan Henson
    Jonathan Henson over 12 years
    So is the write cache my problem or something else?
  • the-wabbit
    the-wabbit over 12 years
    Well, how should I know, it's your system after all :-) You should give some details on the file system mount options used (did you enable extents? what kind of data / journal mode?) and the kind of corruption you're seeing (fsck output would be best) for a more detailed analysis.
  • Jonathan Henson
    Jonathan Henson over 12 years
    OK, thanks. I am a a helpless software engineer you know :). I'll get some details. I am adding some details within the minute.
  • Jonathan Henson
    Jonathan Henson over 12 years
    I don't know what extents are and I am not sure what a Journal mode is.
  • Jonathan Henson
    Jonathan Henson over 12 years
    There are configuration files that need to be edited by the application as well as the /etc/networks and the hostname file. Could you give me a recommendation i.e. something like, you need one partition with such and such type and another for your config files of another type and so on? I really have no idea about these things. I write software and am magically expected to know exactly (not that I don't know enough to write *nix software, but I certainly don't know as much as a dedicated systems guy) how the hardware should work by my employer.
  • David Schwartz
    David Schwartz over 12 years
    Worst case, you can use a system partition (never writable) and two configuration partitions. If the primary partition is unreadable or incomplete, boot from the secondary, reformat the primary, and copy the secondary into it. Update the primary and secondary in non-overlapping operations.
  • the-wabbit
    the-wabbit over 12 years
    Ah, I see. Just post the first lines of the output of dumpe2fs /dev/sda1 (or whatever your device/partition name for this system would be) - they should contain all relevant information. And the mount options for the root filesystem from /etc/fstab should help as well.
  • Jonathan Henson
    Jonathan Henson over 12 years
    Ok, I updated my answer. I will probably take your advice and take this to an old professor of mine from my graduate program. In the meantime, is there a quick and dirty that will at least get me in a better position that doesn't include my ass in a frying pan?
  • Jonathan Henson
    Jonathan Henson over 12 years
    Ok, I updated my answer. I will probably take your advice and take this to an old professor of mine from my graduate program. In the meantime, is there a quick and dirty that will at least get me in a better position that doesn't include my ass in a frying pan?
  • polynomial
    polynomial over 12 years
    Turning off write caching or running 'sync' on a regular basis would probably help in the short term.
  • the-wabbit
    the-wabbit over 12 years
    The answer's updated as well.
  • Jonathan Henson
    Jonathan Henson over 12 years
    Thank you! That works beautifully! Will these settings be transferred on a dd?
  • Jonathan Henson
    Jonathan Henson over 12 years
    I am still occasionally getting some orphaned nodes which causes the system not to boot. Any ideas?
  • the-wabbit
    the-wabbit about 12 years
    @JonathanHenson what journal settings did you end up using?
  • Jonathan Henson
    Jonathan Henson about 12 years
    I re-partitioned everything to put /dev/sda1 as /, /dev/sda3 /var /dev/sda4 /usr ... and so on. I then setup / to be read-only with all files systems as ext3 filesystems and ordered. Most of my config files are in /var, but when I need to write to /etc, I run mount -o remount,rw / , do my changes, then run mount -o remount,ro /. I run an fsck -y on all non-read-only partitions upon startup. Then, I don't have anymore problems. If you'd like, I can show you the output of mount and fdisk -l.
  • Robert Calhoun
    Robert Calhoun over 10 years
    you said fsck seems to fix your problem: you can tell ext4 to fsck every boot by setting "Maximum mount count" to one. (I'm a little unclear on whether tune2fs -c0 always runs e2fsck or never runs e2fsck, but it's one or the other.)