How can I speed up operations on sparse files with tar, gzip, rsync?

rsync tar sparse-files

7,665

Solution 1

bsdtar (at least from libarchive 3.1.2) is able to detect sparse sections using the FS_IOC_FIEMAP ioctl on the file systems that support it (though it supports a number of other APIs as well), however, at least in my test, strangely enough, it is not able to handle the tar files it generates itself (looks like a bug though).

However using GNU tar to extract them works, but then GNU tar can't handle some of the extended attributes that bsdtar supports.

bsdtar cf - sparse-files | (cd elsewhere && tar xpf -)

works as long as the files don't have extended attributes or flags.

It still doesn't work for files that are fully sparse (only zeros) as the FS_IOC_FIEMAP ioctl then returns 0 extent and it looks like bsdtar doesn't handle that properly (another bug?).

star (Schily tar) is another opensource tar implementation that can detect sparse files (use the -sparse option) and doesn't have those bugs of bsdtar (but is not packaged by many systems).

Solution 2

This article has some useful suggestions for rsync at least:

Problems

Using rsync --sparse works, but causes a huge a mount of unnecessary disk writes. Changing 10 bytes on 50GB long (1GB used) should cause only one or two blocks to be written, this causes 1GB to be written. This is slow, and possible not good for the disks' longevity.

Using rsync --inplace works, but creates non-sparse files.

You cannot use --sparse and --inplace at the same time :-( this is disallowed by rsync. rsync: --sparse cannot be used with --inplace

Solution

If you use --inplace to update a pre-existing sparse file, the file will remain sparse and only have a small number of blocks written. It's only when rsync --inplace creates a file that it makes it non-sparse.

So the solution is to create a corresponding, correctly-lengthed, empty, sparse file on the target machine for every file on the source machine - if the file isn't yet present on the target machine.

Then rsync --inplace will work as intended, leaving sparse files sparse, and only writing the changed blocks to disk.

So, if I read that correctly, you want to first create an empty sparse file on the target. You can do this with

truncate -s 3G filename

You can then use rsync --inplace to copy the files over. This should only be necessary once.

The same article suggests using Virtsync which is

a $49 commercial Linux command-line tool for synchronizing the contents of huge files (such as virtual machine disk images and databases).

This might be the best solution if you're willing to pay for it since it seems to be written specifically for this type of situation.

7,665

adrelanos

A maintainer of Whonix.

Updated on September 18, 2022

Comments

adrelanos almost 2 years

I have a sparse file. (du -h reports 3G and du -h --apparent-size reports 100G.) So far, so good.

Now, when I want to compress the file using tar or send it over the network using rsync, it will require as much time as 3G. It seems these tools read all the zeros.

I thought the holes are somehow marked and these tools could somehow just skip them?

There is likely no issue with my file?

Is this a missing feature in tar and rsync to not look for sparse files? I used the tar parameter --sparse, but that didn't speed up things. Neither did rsync parameter --sparse.

Is there any way to speed these tools up on sparse files?