Fastest way combine many files into one (tar czf is too slow)
Solution 1
You should check if most of your time are being spent on CPU or in I/O. Either way, there are ways to improve it:
A: don't compress
You didn't mention "compression" in your list of requirements so try dropping the "z" from your arguments list: tar cf
. This might be speed up things a bit.
There are other techniques to speed-up the process, like using "-N " to skip files you already backed up before.
B: backup the whole partition with dd
Alternatively, if you're backing up an entire partition, take a copy of the whole disk image instead. This would save processing and a lot of disk head seek time. tar
and any other program working at a higher level have a overhead of having to read and process directory entries and inodes to find where the file content is and to do more head disk seeks, reading each file from a different place from the disk.
To backup the underlying data much faster, use:
dd bs=16M if=/dev/sda1 of=/another/filesystem
(This assumes you're not using RAID, which may change things a bit)
Solution 2
To repeat what others have said: we need to know more about the files that are being backed up. I'll go with some assumptions here.
Append to the tar file
If files are only being added to the directories (that is, no file is being deleted), make sure you are appending to the existing tar file rather than re-creating it every time. You can do this by specifying the existing archive filename in your tar
command instead of a new one (or deleting the old one).
Write to a different disk
Reading from the same disk you are writing to may be killing performance. Try writing to a different disk to spread the I/O load. If the archive file needs to be on the same disk as the original files, move it afterwards.
Don't compress
Just repeating what @Yves said. If your backup files are already compressed, there's not much need to compress again. You'll just be wasting CPU cycles.
Solution 3
Using tar with lz4 crompression like in
tar cvf - myFolder | lz4 > myFolder.tar.lz4
gives you the best of both worlds (rather good compression AND speed). Expect a compression ratio of about 3 even if your data contains binary files.
Further reading: comparison of compression algorithms How to tar with lz4
Solution 4
I'm surprised that no one mention dump and restore. It will be a lot faster than dd if you have free space in the filesystem.
Note that depending on the filesystem in question you may need different tools:
- ext2/3/4 - dump and restore (package dump in RH/Debian)
- XFS - xfsdump and xfsrestore (package xfsdump in RH/Debian)
- ZFS - zfs send and zfs recv
- BTRFS - btrfs send and btrfs receive
Note that some program does not have built-in compression (all except dump) - pipe to stdout and use pigz as needed. ;-)
Najib-botak Chin
Updated on September 18, 2022Comments
-
Najib-botak Chin over 1 year
Currently I'm running
tar czf
to combine backup files. The files are in a specific directory.But the number of files is growing. Using
tzr czf
takes too much time (more than 20 minutes and counting).I need to combine the files more quickly and in a scalable fashion.
I've found
genisoimage
,readom
andmkisofs
. But I don't know which is fastest and what the limitations are for each of them.-
Gilles 'SO- stop being evil' over 12 yearsI doubt that
tar
introduces any significant overhead, reading the files is the expensive operation here. You should either modify the way your files are stored, or use a radically different approach (copy the filesystem as a whole). We can't help you much without knowing how your files are organized. -
Rufo El Magufo over 12 yearsMount your FS with the "noatime" option, maybe speedup the IO operations.
-
J. M. Becker over 10 years+1 for noatime, it really does make a significant difference. Especially for regular hard drives, and also just for reducing extraneous writes.
-
Aleksandr Dubinsky about 2 yearstar is indeed slow because it doesn’t make full use async i/o
-
-
Rufo El Magufo over 12 yearsdon't compress: or use
pigz
if exist in the system more than one processor. -
LiveWireBT over 6 yearsLZ4/zstd and similarly fast compression algorithms may still be worth to check if they can speed up a process by just writing less data (if the data is compressible at all) while being an order of magnitude faster in compression but less efficient depending on the level and algorithm, also man gzip says "The default compression level is -6", so there is room for improvement.
-
Lester Cheung over 4 yearsWhat StefanQ is staying is that you need to choose your compressor depending on where your bottleneck is. Also: remember you can save the output to a different physical storage device or even a remote machine!
-
mgutt over 2 yearsMultiple critical bugs and last support was 2016: sourceforge.net/p/dump/bugs
-
Lester Cheung over 2 yearsWrong project - the correct project should be: sourceforge.net/projects/e2fsprogs - I'm pretty sure all dump and restore tools for all the filesystems listed above are still actively maintained...
-
mgutt over 2 yearsNo it's correct. dump is a separate project of e2fsprogs as mentioned on their website: e2fsprogs.sourceforge.net/ext2.html
-
Lester Cheung over 2 yearsYou are absolutely correct - I stand corrected. But seriously append files to archives/filesystem snapshots is the way to go in 2021 ;-)