Utilizing multi core for tar+gzip/bzip compression/decompression

198,714

Solution 1

You can use pigz instead of gzip, which does gzip compression on multiple cores. Instead of using the -z option, you would pipe it through pigz:

tar cf - paths-to-archive | pigz > archive.tar.gz

By default, pigz uses the number of available cores, or eight if it could not query that. You can ask for more with -p n, e.g. -p 32. pigz has the same options as gzip, so you can request better compression with -9. E.g.

tar cf - paths-to-archive | pigz -9 -p 32 > archive.tar.gz

Solution 2

You can also use the tar flag "--use-compress-program=" to tell tar what compression program to use.

For example use:

tar -c --use-compress-program=pigz -f tar.file dir_to_zip 

Solution 3

Common approach

There is option for tar program:

-I, --use-compress-program PROG
      filter through PROG (must accept -d)

You can use multithread version of archiver or compressor utility.

Most popular multithread archivers are pigz (instead of gzip) and pbzip2 (instead of bzip2). For instance:

$ tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 paths_to_archive
$ tar --use-compress-program=pigz -cf OUTPUT_FILE.tar.gz paths_to_archive

Archiver must accept -d. If your replacement utility hasn't this parameter and/or you need specify additional parameters, then use pipes (add parameters if necessary):

$ tar cf - paths_to_archive | pbzip2 > OUTPUT_FILE.tar.gz
$ tar cf - paths_to_archive | pigz > OUTPUT_FILE.tar.gz

Input and output of singlethread and multithread are compatible. You can compress using multithread version and decompress using singlethread version and vice versa.

p7zip

For p7zip for compression you need a small shell script like the following:

#!/bin/sh
case $1 in
  -d) 7za -txz -si -so e;;
   *) 7za -txz -si -so a .;;
esac 2>/dev/null

Save it as 7zhelper.sh. Here the example of usage:

$ tar -I 7zhelper.sh -cf OUTPUT_FILE.tar.7z paths_to_archive
$ tar -I 7zhelper.sh -xf OUTPUT_FILE.tar.7z

xz

Regarding multithreaded XZ support. If you are running version 5.2.0 or above of XZ Utils, you can utilize multiple cores for compression by setting -T or --threads to an appropriate value via the environmental variable XZ_DEFAULTS (e.g. XZ_DEFAULTS="-T 0").

This is a fragment of man for 5.1.0alpha version:

Multithreaded compression and decompression are not implemented yet, so this option has no effect for now.

However this will not work for decompression of files that haven't also been compressed with threading enabled. From man for version 5.2.2:

Threaded decompression hasn't been implemented yet. It will only work on files that contain multiple blocks with size information in block headers. All files compressed in multi-threaded mode meet this condition, but files compressed in single-threaded mode don't even if --block-size=size is used.

Recompiling with replacement

If you build tar from sources, then you can recompile with parameters

--with-gzip=pigz
--with-bzip2=lbzip2
--with-lzip=plzip

After recompiling tar with these options you can check the output of tar's help:

$ tar --help | grep "lbzip2\|plzip\|pigz"
  -j, --bzip2                filter the archive through lbzip2
      --lzip                 filter the archive through plzip
  -z, --gzip, --gunzip, --ungzip   filter the archive through pigz

Solution 4

You can use the shortcut -I for tar's --use-compress-program switch, and invoke pbzip2 for bzip2 compression on multiple cores:

tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 DIRECTORY_TO_COMPRESS/

Solution 5

If you want to have more flexibility with filenames and compression options, you can use:

find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec \
tar -P --transform='s@/my/path/@@g' -cf - {} + | \
pigz -9 -p 4 > myarchive.tar.gz

Step 1: find

find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec

This command will look for the files you want to archive, in this case /my/path/*.sql and /my/path/*.log. Add as many -o -name "pattern" as you want.

-exec will execute the next command using the results of find: tar

Step 2: tar

tar -P --transform='s@/my/path/@@g' -cf - {} +

--transform is a simple string replacement parameter. It will strip the path of the files from the archive so the tarball's root becomes the current directory when extracting. Note that you can't use -C option to change directory as you'll lose benefits of find: all files of the directory would be included.

-P tells tar to use absolute paths, so it doesn't trigger the warning "Removing leading `/' from member names". Leading '/' with be removed by --transform anyway.

-cf - tells tar to use the tarball name we'll specify later

{} + uses everyfiles that find found previously

Step 3: pigz

pigz -9 -p 4

Use as many parameters as you want. In this case -9 is the compression level and -p 4 is the number of cores dedicated to compression. If you run this on a heavy loaded webserver, you probably don't want to use all available cores.

Step 4: archive name

> myarchive.tar.gz

Finally.

Share:
198,714

Related videos on Youtube

user1118764
Author by

user1118764

Updated on May 11, 2020

Comments

  • user1118764
    user1118764 about 4 years

    I normally compress using tar zcvf and decompress using tar zxvf (using gzip due to habit).

    I've recently gotten a quad core CPU with hyperthreading, so I have 8 logical cores, and I notice that many of the cores are unused during compression/decompression.

    Is there any way I can utilize the unused cores to make it faster?

    • Warren Severin
      Warren Severin over 6 years
      The solution proposed by Xiong Chiamiov above works beautifully. I had just backed up my laptop with .tar.bz2 and it took 132 minutes using only one cpu thread. Then I compiled and installed tar from source: gnu.org/software/tar I included the options mentioned in the configure step: ./configure --with-gzip=pigz --with-bzip2=lbzip2 --with-lzip=plzip I ran the backup again and it took only 32 minutes. That's better than 4X improvement! I watched the system monitor and it kept all 4 cpus (8 threads) flatlined at 100% the whole time. THAT is the best solution.
  • user788171
    user788171 about 11 years
    How do you use pigz to decompress in the same fashion? Or does it only work for compression?
  • Mark Adler
    Mark Adler about 11 years
    pigz does use multiple cores for decompression, but only with limited improvement over a single core. The deflate format does not lend itself to parallel decompression. The decompression portion must be done serially. The other cores for pigz decompression are used for reading, writing, and calculating the CRC. When compressing on the other hand, pigz gets close to a factor of n improvement with n cores.
  • Randall Hunt
    Randall Hunt over 10 years
    This is an awesome little nugget of knowledge and deserves more upvotes. I had no idea this option even existed and I've read the man page a few times over the years.
  • Garrett
    Garrett about 10 years
    The hyphen here is stdout (see this page).
  • slhsen
    slhsen almost 10 years
    So as far as I understand files generated by pigz are compatible with gzip right? Can I decompress a file with gzip which had been created with pigz?
  • Mark Adler
    Mark Adler almost 10 years
    Yes. 100% compatible in both directions.
  • CharlesL
    CharlesL about 9 years
    pigz can use multiple cores for compression, but the tar operation is still using only one core. Is there a parallel tar?
  • Mark Adler
    Mark Adler about 9 years
    There is effectively no CPU time spent tarring, so it wouldn't help much. The tar format is just a copy of the input file with header blocks in between files.
  • Admin
    Admin about 9 years
    This is indeed the best answer. I'll definitely rebuild my tar!
  • Admin
    Admin about 9 years
    I just found pbzip2 and mpibzip2. mpibzip2 looks very promising for clusters or if you have a laptop and a multicore desktop computer for instance.
  • oᴉɹǝɥɔ
    oᴉɹǝɥɔ almost 9 years
    This is a great and elaborate answer. It may be good to mention that multithreaded compression (e.g. with pigz) is only enabled when it reads from the file. Processing STDIN may in fact be slower.
  • bovender
    bovender over 8 years
    @ValerioSchiavoni: Not here, I get full load on all 4 cores (Ubuntu 15.04 'Vivid').
  • kasur
    kasur over 8 years
    I have submitted an edit to the answer to indicate the default number of compression thread to be equal to the number of online processors as per official docs but not to the 8 cores as was specified in the answer originally. Thanks.
  • Mark Adler
    Mark Adler over 8 years
    The edit seems to have been rejected by someone else, but I will make a similar edit.
  • Lester Cheung
    Lester Cheung over 8 years
    just drop the -f option of tar if you want stdout. ;-)
  • jmiserez
    jmiserez about 8 years
    Beware that redirecting (>) will simply overwrite existing files unless you have set -o noclobber set.
  • selurvedu
    selurvedu almost 8 years
    Plus 1 for xz option. It the simplest, yet effective approach.
  • Offenso
    Offenso over 7 years
    I prefer tar - dir_to_zip | pv | pigz > tar.file pv helps me estimate, you can skip it. But still it easier to write and remember.
  • einpoklum
    einpoklum over 7 years
    A nice TL;DR for @MaximSuslov's answer.
  • William T Froggard
    William T Froggard about 6 years
    Also worth noting that since pigz probably is going to be network-bound in most situations unless you make it work hard, increasing the block size can dramatically improve performance. By increasing its block size to 524288 (512MB), I'm seeing numbers as high as 80MB/s over 802.11ac wifi. I believe the transfer is still network-bound, so you may see better results over gigabit ethernet. I sometimes see insane 400MB/s spikes, but those are scary and odd, so I'm not sure what to make of them.
  • Mark Adler
    Mark Adler about 6 years
    @WilliamTFroggard The spikes may be due to the burstiness of the deflate algorithm. Uncompressed data is collected until a deflate block can be produced, at which time the block is rapidly generated and emitted.
  • scai
    scai over 5 years
    export XZ_DEFAULTS="-T 0" before calling tar with option -J for xz compression works like a charm.
  • Andre Figueiredo
    Andre Figueiredo over 5 years
    Wouldn't be more performatic to use -l instead of STDIN/STDOUT?
  • Mark Adler
    Mark Adler over 5 years
    I wouldn't know, since "performatic" is not a word.
  • Marc.2377
    Marc.2377 over 4 years
    @NathanS.Watson-Haigh Yes do you. Just enclose the program name and arguments in quotes. man tar says so, as does this.
  • jadelord
    jadelord over 4 years
    In 2020, zstd is the fastest tool to do this. Noticeable speedup while compressing and decompressing. Use tar -cf --use-compress-program=zstdmt to do so with multi-threading.
  • Arash
    Arash about 4 years
    This returns tar: home/cc/ziptest: Cannot stat: No such file or directory tar: Exiting with failure status due to previous errors `
  • Arik
    Arik almost 4 years
    This is actually faster than tar -c --use-compress-program=pigz
  • ruario
    ruario over 3 years
    This answer looks like it was largely lifted directly from my LQ post. A link back might have ben nice.