Multithreaded xz, with gzip, pv, and pipes - is this the most efficient I can get?

6,902

With the -T0 multithread option you tell xz two things at once. To use MT also means: wait until all input (data) is read into memory, and then start to compress in "parallel".

After including pigz into my tests I analyze the perfomance step by step; I have a 100M file f100.

$  time xz -c  f100 >/dev/null

real  0m2.658s
user  0m2.573s
sys   0m0.083s

99% of the time is spent compressing on one core. With all four cores activated with -T4 (or -T0)

$  time xz -c -T4 f100 >/dev/null

real  0m0.825s
user  0m2.714s
sys   0m0.284s

Overall result: 300% faster, almost linear per core. The "user" value must be divided by 4, the way it is reported/defined. "sys" now shows some overhead -- real is the sum of 1/4 user plus sys.

$  time gzip     -dc f100.gz >/dev/null
$  time pigz -p4 -dc f100.gz >/dev/null

This is 0.5 vs. 0.2 seconds; when I put all together:

$  time pigz -dc -p4 f100.gz | xz -c -T4 >out.xz

real  0m0.902s
user  0m3.237s
sys   0m0.363s

...it reduces 0.8 + 0.2 = 0.9.

With multiple files, but not too multiple, you can get highest overall parallelism with 4 shell background processes. Here I use four 25M files instead:

for f in f25-?.gz; do time pigz -p4 -dc "$f" | xz -c -T0 >"$f".xz & done

This seems even slightly faster with 0.7s. And even without multithreading, even for xz:

for f in f25-?.gz; do time gzip -dc "$f" | xz -c >"$f".xz & done

just by setting up four simple quarter pipelines with &, you get 0.8s, same as for a 100M file with xz -T4.

In my scenario it is about just as important to activate multithreading in xz than it is to parallelize the whole pipeline; if you can combine this with pigz and/or multiple files, you can even be a bit faster than a quarter of the sum of the single steps.

Share:
6,902

Related videos on Youtube

tu-Reinstate Monica-dor duh
Author by

tu-Reinstate Monica-dor duh

Tudor's just 'zis guy, y'know.

Updated on September 18, 2022

Comments

  • tu-Reinstate Monica-dor duh
    tu-Reinstate Monica-dor duh about 1 year

    I'm excited to learn that xz now supports multithreading:

    xz --threads=0
    

    But now I want to utilise this as much as possible. For example, to recompress gzips as xz:

    gzip -d -k -c myfile.gz | pv | xz -z --threads=0 - > myfile.xz
    

    This results in my processor being more highly used (~260% CPU to xz, yay!).

    However:

    • I realise that gzip is not (yet) multithreading,
    • I think that either pv or the pipes may be restricting the number of (IO?) threads.

    Is this true and, if so, is there a way to make this more efficient (other than to remove pv)?

    • Admin
      Admin about 4 years
      Can you give more details about the scenario? In my answer I point out that the number of .gz files matters a lot.
    • tu-Reinstate Monica-dor duh
      tu-Reinstate Monica-dor duh about 4 years
      In my case it's actually one large disk image being recompressed. I came across this because of the gzip 32-bit size reference limit and I wanted my compressed files to show the right uncompressed size. There was a significant improvement in recompression, though, by about 40% compared to without threads, mostly during the nulled part of the uncompressed image where it reached 100MB/s (over USB3.0) according to pv. I put this down to lower IO wait times on the gzip end due to more "waiting" threads on the xz end but I wondered whether the pipe and pv were a bottleneck.
    • Admin
      Admin about 4 years
      And did you time the difference at all between -T0 and without? Acoording to my trials it does not make a difference, you loose as much as you win.
    • tu-Reinstate Monica-dor duh
      tu-Reinstate Monica-dor duh about 4 years
      Yes, that's how I got the ~40%, but I didn't keep a copy, sorry.
    • Admin
      Admin about 4 years
      I got over 300%. So this shows that input (pipe) is the big bottleneck, not xz. It is in between my 300% and the 0.1% you get when you slow down the dd pipe with count=10 (only 10 bytes per read).
    • Admin
      Admin about 4 years
      I did my tests with a 100MB file on a ramdisk. Your Q is very interseting, so it needs precision, and some testing (timing).
    • Alessio
      Alessio about 4 years
      have you tried pigz -d instead of gzip -d? that might improve performance a little. pigz -d can't decompress in parallel - however, it can run 4 threads at a time (one each for reading, writing, checksum calcs, and decompression). see man pigz for details. If it's not packaged for your distribution, you can find pigz at zlib.net/pigz - in debian etc, sudo apt-get install pigz
    • Admin
      Admin about 4 years
      @cas: "pigz...which can speed up decompression under some circumstances." This must be multiple files on a good system. "specially prepared deflate streams " seem to be the workaround.
    • Admin
      Admin about 4 years
      @cas I injstalled pigz and tested - this is really a faster decompression, even though the algorithm itself is on one core, as you explain. Overall decompression is only about 1/4 of the work, so pigz here is only secondary. Thanks to your hint I made some tests, see my answer.