Optimal way to combine tar.gz files quickly

8,986

TLDR: you can usually just concatenate them

The file format usd by gzip is designed so that concatenating two or more compressed files and decompressing the result gives you same data as concatenating the uncompressed versions; see https://stackoverflow.com/questions/8005114/fast-concatenation-of-multiple-gzip-files
https://stackoverflow.com/questions/16715484/can-multiple-gz-files-be-combined-such-that-they-extract-into-a-single-file

Somewhat similarly the tar format was originally designed so that you could just add entries to the end of an archive. This was effectively required because '(t)ape (ar)chive' was designed to and did use magnetic tape for backup and interchange, and the magnetic tape drives of the 1950s-1980s (roughly) could not safely 'rewrite' (update) existing data only add to the end. (Those drives could separate logical files on a tape using a 'tape mark' but Unix systems didn't support metadata aka labels on magtape and managing large numbers of tape files by physical numeric position only was a PITA, so the tar approach of adding to an existing archive was much preferred.)

In recent years this has become much less common, and GNU tar now doesn't support it by default; you have to specify -i (or long form --ignore-zeros) and then it works fine:

$ printf 'ONEONEONE%90d\n' {0..99999} >file1
$ printf 'TWOTWOTWO%90d\n' {0..199999} >file2
$ ll
total 29300
-rw-r--r--. 1 dthomps users 10000000 Sep  9 02:14 file1
-rw-r--r--. 1 dthomps users 20000000 Sep  9 02:15 file2
$ tar -czf tar1.tgz file1
$ tar -czf tar2.tgz file2
    # or tar -cf - file1 |gzip >tar1.tgz and similarly for 2, see below
$ cat tar2.tgz tar1.tgz >combined.tgz
$ tar -tvzif combined.tgz
-rw-r--r-- dthomps/users 20000000 2016-09-09 02:15 file2
-rw-r--r-- dthomps/users 10000000 2016-09-09 02:14 file1
  # or gunzip <combined.tgz |tar -tvif - see below
$

Older tars may support concatenating archives by default (no -i); if I have time to spin up some of my old test systems later I'll update. However they usually don't support integrated -z compression like gtar, so you need to use the tar cf - | gzip > and gunzip < | tar -xf - forms.

If you use relative paths for files in the archive, as is common and preferred today, when you extract from the concatenated result all the entries (or all the selected ones) are extracted relative to the same new directory, so make sure you create each archive 'piece' with relative paths that work together as desired. If you want file in an appended piece to replace one in the main piece, use the same relative path/name; if you want to create distinct files use distinct relative paths/names.

Share:
8,986

Related videos on Youtube

Andrew
Author by

Andrew

Updated on September 18, 2022

Comments

  • Andrew
    Andrew over 1 year

    I am looking for a way to combine multiple tar.gz files quickly.

    The use case is a client clicks on a download button and proceeds to have a tar.gz file delivered to them. There is a configuration option to add additional information to the outbound download in the GUI of our application. If this option is selected I am going to need to combine additional tar.gz files on to the outbound download.

    I am working with a lot of data here. The additional tar.gz files are over a GB when uncompressed. In addition, the default tar.gz file that is always delivered can be over 10 GB when uncompressed and can contain over 100 files in it. Due to the large sizes of the data I am working with, it is stored in a compressed format (tar.gz) on disk.

    I am looking to implement this mechanism either in Bash Script or in Java.

  • Evan Hu
    Evan Hu about 2 years
    You are really a genius. You just gave me a great way to reduce our build time dramatically. Thank you very much. I had tested it in centos of kernel version 3.10.0-1160.59.1.el7.x86_64 and the -i option still exist in tar command and it is this option which did the magic.