Fastest way combine many files into one (tar czf is too slow)

60,764

Solution 1

You should check if most of your time are being spent on CPU or in I/O. Either way, there are ways to improve it:

A: don't compress

You didn't mention "compression" in your list of requirements so try dropping the "z" from your arguments list: tar cf. This might be speed up things a bit.

There are other techniques to speed-up the process, like using "-N " to skip files you already backed up before.

B: backup the whole partition with dd

Alternatively, if you're backing up an entire partition, take a copy of the whole disk image instead. This would save processing and a lot of disk head seek time. tar and any other program working at a higher level have a overhead of having to read and process directory entries and inodes to find where the file content is and to do more head disk seeks, reading each file from a different place from the disk.

To backup the underlying data much faster, use:

dd bs=16M if=/dev/sda1 of=/another/filesystem

(This assumes you're not using RAID, which may change things a bit)

Solution 2

To repeat what others have said: we need to know more about the files that are being backed up. I'll go with some assumptions here.

Append to the tar file

If files are only being added to the directories (that is, no file is being deleted), make sure you are appending to the existing tar file rather than re-creating it every time. You can do this by specifying the existing archive filename in your tar command instead of a new one (or deleting the old one).

Write to a different disk

Reading from the same disk you are writing to may be killing performance. Try writing to a different disk to spread the I/O load. If the archive file needs to be on the same disk as the original files, move it afterwards.

Don't compress

Just repeating what @Yves said. If your backup files are already compressed, there's not much need to compress again. You'll just be wasting CPU cycles.

Solution 3

Using tar with lz4 crompression like in

tar cvf - myFolder | lz4 > myFolder.tar.lz4

gives you the best of both worlds (rather good compression AND speed). Expect a compression ratio of about 3 even if your data contains binary files.

Further reading: comparison of compression algorithms How to tar with lz4

Solution 4

I'm surprised that no one mention dump and restore. It will be a lot faster than dd if you have free space in the filesystem.

Note that depending on the filesystem in question you may need different tools:

  • ext2/3/4 - dump and restore (package dump in RH/Debian)
  • XFS - xfsdump and xfsrestore (package xfsdump in RH/Debian)
  • ZFS - zfs send and zfs recv
  • BTRFS - btrfs send and btrfs receive

Note that some program does not have built-in compression (all except dump) - pipe to stdout and use pigz as needed. ;-)

Share:
60,764
Najib-botak Chin
Author by

Najib-botak Chin

Updated on September 18, 2022

Comments

  • Najib-botak Chin
    Najib-botak Chin over 1 year

    Currently I'm running tar czf to combine backup files. The files are in a specific directory.

    But the number of files is growing. Using tzr czf takes too much time (more than 20 minutes and counting).

    I need to combine the files more quickly and in a scalable fashion.

    I've found genisoimage, readom and mkisofs. But I don't know which is fastest and what the limitations are for each of them.

    • Gilles 'SO- stop being evil'
      Gilles 'SO- stop being evil' over 12 years
      I doubt that tar introduces any significant overhead, reading the files is the expensive operation here. You should either modify the way your files are stored, or use a radically different approach (copy the filesystem as a whole). We can't help you much without knowing how your files are organized.
    • Rufo El Magufo
      Rufo El Magufo over 12 years
      Mount your FS with the "noatime" option, maybe speedup the IO operations.
    • J. M. Becker
      J. M. Becker over 10 years
      +1 for noatime, it really does make a significant difference. Especially for regular hard drives, and also just for reducing extraneous writes.
    • Aleksandr Dubinsky
      Aleksandr Dubinsky about 2 years
      tar is indeed slow because it doesn’t make full use async i/o
  • Rufo El Magufo
    Rufo El Magufo over 12 years
    don't compress: or use pigz if exist in the system more than one processor.
  • LiveWireBT
    LiveWireBT over 6 years
    LZ4/zstd and similarly fast compression algorithms may still be worth to check if they can speed up a process by just writing less data (if the data is compressible at all) while being an order of magnitude faster in compression but less efficient depending on the level and algorithm, also man gzip says "The default compression level is -6", so there is room for improvement.
  • Lester Cheung
    Lester Cheung over 4 years
    What StefanQ is staying is that you need to choose your compressor depending on where your bottleneck is. Also: remember you can save the output to a different physical storage device or even a remote machine!
  • mgutt
    mgutt over 2 years
    Multiple critical bugs and last support was 2016: sourceforge.net/p/dump/bugs
  • Lester Cheung
    Lester Cheung over 2 years
    Wrong project - the correct project should be: sourceforge.net/projects/e2fsprogs - I'm pretty sure all dump and restore tools for all the filesystems listed above are still actively maintained...
  • mgutt
    mgutt over 2 years
    No it's correct. dump is a separate project of e2fsprogs as mentioned on their website: e2fsprogs.sourceforge.net/ext2.html
  • Lester Cheung
    Lester Cheung over 2 years
    You are absolutely correct - I stand corrected. But seriously append files to archives/filesystem snapshots is the way to go in 2021 ;-)