fallocate vs posix_fallocate

10,910

Solution 1

Having files that take up more storage space than their displayed length is not usual, so unless you have a good reason for doing that (e.g. you want to use the file length to keep track of how far a download got, for the purpose of resuming it), best to use the default fallocate(2) behaviour. (without FALLOC_FL_KEEP_SIZE). This is the same semantics as posix_fallocate(3).

The man page for fallocate(2) even says that its default behaviour (no flags) is intended as an optimal way of implementing posix_fallocate(3), and points to that as a portable way to allocate space.

The original question says something about writing zeros to the file. None of these calls write anything but metadata. If you read from space that's been preallocated but not yet written, you'll get zeros (not whatever was in that disk space previously, that would be a big security hole). You can only read up to the end of a file (the length, set by fallocate, ftruncate, or various other ways), so if you have a zero-length file and fallocate with FALLOC_FL_KEEP_SIZE, then you can't read anything. Nothing to do with preallocation, just file size semantics.

So if you're fine with the POSIX semantics, use it, because it's more portable. Every GNU/Linux system will support posix_fallocate(3), but so will some other systems.

However, thanks to POSIX semantics, it's not that simple. If you use it on a filesystem that doesn't support preallocation, it will still succeed, but do so by falling back to actually writing a zero in every block of the file.

Test program:

#include <fcntl.h>
int main() {
    int fd = open("foo", O_RDWR|O_CREAT, 0666);
    if (fd < 0) return 1;
    return posix_fallocate(fd, 0, 400000);
}

on XFS

$ strace ~/src/c/falloc
...
open("foo", O_RDWR|O_CREAT, 0666) = 3
fallocate(3, 0, 0, 400000)              = 0
exit_group(0)                           = ?

on a fat32 flash drive:

open("foo", O_RDWR|O_CREAT, 0666) = 3
fallocate(3, 0, 0, 400000)              = -1 EOPNOTSUPP (Operation not supported)
fstat(3, {st_mode=S_IFREG|0755, st_size=400000, ...}) = 0
fstatfs(3, {f_type="MSDOS_SUPER_MAGIC", f_bsize=65536, f_blocks=122113, f_bfree=38274, f_bavail=38274, f_files=0, f_ffree=0, f_fsid={2145, 0}, f_namelen=1530, f_frsize=65536}) = 0
pread(3, "\0", 1, 6783)                 = 1
pwrite(3, "\0", 1, 6783)                = 1
pread(3, "\0", 1, 72319)                = 1
pwrite(3, "\0", 1, 72319)               = 1
pread(3, "\0", 1, 137855)               = 1
pwrite(3, "\0", 1, 137855)              = 1
pread(3, "\0", 1, 203391)               = 1
pwrite(3, "\0", 1, 203391)              = 1
pread(3, "\0", 1, 268927)               = 1
pwrite(3, "\0", 1, 268927)              = 1
pread(3, "\0", 1, 334463)               = 1
pwrite(3, "\0", 1, 334463)              = 1
pread(3, "\0", 1, 399999)               = 1
pwrite(3, "\0", 1, 399999)              = 1
exit_group(0)                           = ?

It does avoid the reads if the file wasn't yet that long, but writing every block is still horrible.

If you want something simple, I'd still just go with posix_fallocate. There's a FreeBSD man page for it, and it's specified by POSIX, so every POSIX-compliant system provides it. The one drawback is that it will be horrible with glibc on a filesystem that doesn't support preallocation. See for example https://plus.google.com/+AaronSeigo/posts/FGtXM13QuhQ. For a program that works with large files, (e.g. torrents), this could be really bad.

You can thank POSIX semantics for requiring glibc to do this, as it doesn't define an error code for "the filesystem doesn't support preallocation". http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html. It also guarantees that if the call succeeds, subsequent writes into the allocated region won't fail due to lack of disk space. So the posix design doesn't provide a way to handle the case where the caller cares about efficiency / performance / fragmentation, rather than disk space guarantees. This forces the POSIX implementation to do the read-write loop, rather than leaving that as an option for callers that need a disk-space guarantee. Thanks POSIX...

I don't know whether non-GNU implementations of posix_fallocate similarly fall back to extremely slow read-write behaviour when the filesystem doesn't support preallocation. (FreeBSD, Solaris?). Apparently OS X (Darwin) doesn't implement posix_fallocate, unless it's very recent.

If you're looking to support preallocation across a lot of platforms, but without falling back to read-then-write if the OS has a way to just attempt preallocation, you have to use whatever platform-specific method is available. e.g. check out https://github.com/arvidn/libtorrent/blob/master/src/file.cpp

search for file::set_size. It has several ifdeffed blocks depending on what the compile target supports, starting with windows code to load DLLs and do stuff there, then fcntl F_PREALLOCATE, or fcntl F_ALLOCSP64, then Linux fallocate(2), then falls back to using posix_fallocate. Also, found this 2007 list post for OS X Darwin: http://lists.apple.com/archives/darwin-dev/2007/Dec/msg00040.html

Solution 2

I take it you didn't look at the documentation that says

   The mode argument determines the operation to be performed on the given range.
   Currently only one flag is supported for mode:

   FALLOC_FL_KEEP_SIZE
          This flag allocates and initializes to zero the disk space within the
          range specified by offset and len.  After a successful call, subsequent
          writes into this range are guaranteed not to fail because of lack of
          disk space.  Preallocating zeroed blocks beyond the end of the file is
          useful for optimizing append workloads.  Preallocating blocks does not
          change the file size (as reported by stat(2)) even if it is less than
          offset+len.

   If FALLOC_FL_KEEP_SIZE flag is not specified in mode, the default behavior is
   almost same as when this flag is specified.  The only difference is that on
   success, the file size will be changed if offset + len is greater than the
   file size.  This default behavior closely resembles the behavior of the
   posix_fallocate(3) library function, and is intended as a method of optimally
   implementing that function.

The man page for posix_fallocate() doesn't appear to have the same thing mentioned, but instead, looking at the source here, it seems to write each block of the file (line 88).

man fallocate man posix_fallocate

Solution 3

At least one bit of information is from the fallocate(2) man page:

int fallocate(int fd, int mode, off_t offset, off_t len);

DESCRIPTION
   This is a nonportable, Linux-specific system call.

Though the system call documentation does not say it, the fallocate(1) program man page says:

As of the Linux Kernel v2.6.31, the fallocate system call is supported
by the btrfs, ext4, ocfs2, and xfs filesystems.

This makes sense to me, as the NTFS, FAT, CDFS, and most other common file systems do not have an internal mechanism on disk to support the call. I presume support for those would be buffered by the kernel and the setting would not persist across system boots.

Share:
10,910
Bill
Author by

Bill

Updated on July 10, 2022

Comments

  • Bill
    Bill almost 2 years

    I am debating which function to use between posix_fallocate and fallocate. posix_fallocate writes a file right away (initializes the characters to NULL). However, fallocate does not change the file size (when using FALLOC_FL_KEEP_SIZE flag). Based on my experimentation, it seems that fallocate does not write NULL or zero characters to the file.

    Can someone please comment based on your experience? Thanks for your time.

  • Bill
    Bill over 11 years
    "Currently only one flag is supported for mode" -- this may not be technically correct. There are some more flags which are supported -- man7.org/linux/man-pages/man2/fallocate.2.html
  • Mats Petersson
    Mats Petersson over 11 years
    Hmm, so I guess one of the problems with a gazillion copies of man pages all over the internet is that you never know when you have found the correct one... However, my Fedora Core 16 man-page says the same one flag as the quoted text, so I'm not sure which is more accurate. It would appear that two are available here: lxr.linux.no/linux+*/fs/open.c#L228 - it would appear it was introduced in 2.6.38 and later kernels.
  • Peter Cordes
    Peter Cordes about 10 years
    glibc's posix_fallocate(3) only falls back to the read-then-write behaviour if fallocate(2) fails with EOPNOTSUPP, e.g. on a fat32 fs, or aufs. There is no way to disable the fallback behaviour in the case where you just want a performance hint, not the disk space guarantee, without using fallocate(2) or fcntl directly, rather than the portable posix_fallocate(3). No idea what posix_fallocate implementations on other systems do (FreeBSD, OS X, Solaris?). See my answer for some links, including code that checks for availability of several methods.
  • Nemo
    Nemo about 9 years
    Actually Linux just returns -1 with errno set to ENOSYS when the file system does not support the operation. This is documented in at least some versions of the man page.
  • Nemo
    Nemo about 9 years
    The latest version of POSIX has added "...or the underlying file system does not support this operation" to the meaning of EINVAL, so glibc and others could now actually implement posix_fallocate in a reasonable way. In theory.
  • Peter Cordes
    Peter Cordes about 9 years
    Nice! It will be many years before linux distros with a write-zeros fallback fall out of use, though, so even for new software it's not safe to assume nice behaviour. The maintainers might also decide that a GNU system should always have a working posix_fallocate, and keep their fallback.
  • Anon
    Anon almost 7 years
    Full fat Linux distros are unlikely to switch to returning EINVAL for posix_fallocate because glibc will keep the current behaviour for compatibility purposes indefinitely. See the threads containing sourceware.org/ml/libc-alpha/2015-10/msg00138.html and sourceware.org/ml/libc-alpha/2015-05/msg00062.html for details.
  • Peter Cordes
    Peter Cordes almost 7 years
    NFS4.2 supports fallocate? Nice, I hadn't realized. Reading from a hole or an unwritten extent still sends the zeros over the wire, though. :(