Why would I tar a single file?

20,596

Solution 1

Advantages of using .tar.gz instead of .gz are that

  • tar stores more meta-data (UNIX permissions etc.) than gzip.
  • the setup can more easily be expanded to store multiple files
  • .tar.gz files are very common, only-gzipped files may puzzle some users. (cf. MelBurslans comment)

The overhead of using tar is also very small.

If not really needed, I still do not recommend to tar a single file. There are many useful tools which can access compressed single files directly (such as zcat, zgrep etc. - also existing for bzip2 and xz).

Solution 2

You are actually asking only half of the question. The other question being, "Why would I compress a tar file with gzip?". And the answer is not just that gzip makes the file smaller (in most cases):

tar:

  • stores filename and other metadata: mode, owner ID, group ID, filesize, modification time
  • stores a checksum (for the header only)

gzip:

  • can store the original filename, but that is optional
  • has a CRC-32 checksum over the original data
  • it compresses the file

With only tar you cannot be sure your data was not corrupted. With only gzip you cannot restore user/group ID, modification time and possible not the original filename.

The combination is more powerful than the individual commmands/formats provides, because they complement each other's features.

Solution 3

There is a quite big advantage to using only-gzipped text files - the contents can be directly accessed with command-line tools like less, zgrep, zcat.

Solution 4

I would say it's likely that the people just don't realise they can use gzip/bzip2/xz without tar. Possibly because they come from a DOS/Windows background where it is normal for compression and archiving to be integrated in a single format (ZIP, RAR, etc).

While there may be slight advantages to using tar in some situations due to the storage of metadata or the ability to add extra files, there are also disadvantages. With a plain gzip/bzip2/xz file you can decompress it and pipe the decompressed data straight to another tool (such as your database) without ever having to store the decompressed data as a file on disk. With a tarball this is harder.

Solution 5

There is an important difference that could make using tar important under some circumstances: Besides the "metadata" that @jofel mentioned in his answer, tar records the filename in the archive. When you extract it, you get the original filename regardless of what the archive is called.

In your case the tar archive and the file it contains have the related names db.dump.tar.gz and db.tar, but suppose you rename the tar file to 20-Apr-16.dump.tgz, or whatever. Untar this with tar xvfz, and you get db.dump. For comparison, unzip 20-Apr-16.dump.gz and you've got 20-Apr-16.dump. (Edit: as pointed out in the comments, gzip also makes a record of the filename; but it's not normally used when unzipping). A tar archive can also contain a relative pathname that puts the extracted file in a subdirectory.

Your use case will dictate whether this kind of filename persistence is needed, or even wanted, or is actually undesirable. But certainly, regardless of compression, a tar archive travels differently from a regular file.

Share:
20,596

Related videos on Youtube

gardenhead
Author by

gardenhead

I like working on interesting problems, and have an interest in many areas of programming: systems, servers, programming languages, networks, parallel computers. In general I love learning and am always looking to grow. I have a preference toward functional or statically-typed languages (preferably both).

Updated on September 18, 2022

Comments

  • gardenhead
    gardenhead over 1 year

    At my company, we download a local development database snapshot as a db.dump.tar.gz file. The compression makes sense, but the tarball only contains a single file (db.dump).

    Is there any point to archiving a single file, or is .tar.gz just such a common idiom? Why not just .gz?

    • plugwash
      plugwash about 8 years
      All tarring a single file will do is add a few metadata blocks to the start and end of the file. The actual file data passes through tar to the compressor untouched. So for a large file the size difference between plain compression and taring will be negligable.
    • Pharap
      Pharap about 8 years
      In the past when trying various compression methods I found .tar.gz to be superior to most other common methods. I recall it was superior to just .tar but cannot remember if it was better than just .gz. Ironically Window's .cab format was the best of the methods I tried, which was very unexpected.
    • gardenhead
      gardenhead about 8 years
      @Pharap tar is not a compression algorithm, it's an archiving format
    • Pharap
      Pharap about 8 years
      @gardenhead Well that would explain why it didn't work very well.
  • gardenhead
    gardenhead about 8 years
    I didn't consider the meta-data aspect. Very good point
  • bgStack15
    bgStack15 about 8 years
    If I see a .gz, my first instinct is to tar -zxf foo.gz. Remembering that gzip is even a command takes a few more seconds.
  • Brandon
    Brandon about 8 years
    @bgStack15 FWIW you don't need the z (or the - for that matter), most modern tars will automatically detect the file needs to be decompressed.
  • hyde
    hyde about 8 years
    With GNU tar,it takes just -O switch to output to stdout, so I wouldn't say it is much harder!
  • underscore_d
    underscore_d about 8 years
    interesting point, but the question is about a database snapshot, unlikely to be a text file, and not only-gzipped.
  • underscore_d
    underscore_d about 8 years
    The first paragraph seems plausible enough for files using the tgz extension. However, the OP's case uses tar.gz - and if these hypothetical ex-Win/DOS users are anything like I was, the first thing they say when looking at such a file is: 'Why does it have 2 extensions?'. Then they google it and quickly get the answer, which specifically explains that tar and compression are distinct. ;-)
  • psusi
    psusi about 8 years
    gzip also records the original filename.
  • Miles
    Miles about 8 years
    Yup. The name is optional in the gzip header—obviously there won't be one if you compressed the streaming output of a command—and most tools won't restore it by default (for instance, you have to use gzip --name explicitly when decompressing), but you don't have to use tar to get filename persistence.
  • gardenhead
    gardenhead about 8 years
    Thanks for clarifying that! When I was reading the tar wikipedia page, I misunderstood the description to mean that the checksum was for the whole file.
  • alexis
    alexis about 8 years
    Thanks for pointing this out, I hadn't known that. Still, since that's not the default behavior, the point stands: Distributing a file in tar format preserves the original filename (and possibly the relative path), without intervention of the recipient. Distributing a (g)zipped file doesn't.
  • Ross Ridge
    Ross Ridge about 8 years
    By default gzip will store the original file name and time stamp. You can use the -N option when decompressing to restore them.
  • YoloTats.com
    YoloTats.com about 8 years
    @RossRidge thanks, I removed again the text about the original file name.
  • Dewi Morgan
    Dewi Morgan about 8 years
    This feels to me like the correct answer. I'd also add a few more reasons, which you might wanto to edit in if you agree. 1) there's no additional cost to the admin for .tgz over .tar or .gz alone: they're all just one command 2) Admins back up, copy, relocate, move a LOT of files, for a lot of different reasons; DB backups are just one of these. They can use the same workflow, tools and commands whether backing up one or multiple files; so why specialcase using the syntax of the gzip command, for the case where there is one file?
  • CodesInChaos
    CodesInChaos about 8 years
    What is the advantage of these tools over simply piping the output of a decompressor into the plain tools?