What is the advantage of using 'tar' today?

linux unix archiving tar

44,078

Solution 1

Part 1: Performance

Here is a comparison of two separate workflows and what they do.

You have a file on disk blah.tar.gz which is, say, 1 GB of gzip-compressed data which, when uncompressed, occupies 2 GB (so a compression ratio of 50%).

The way that you would create this, if you were to do archiving and compression separately, would be:

tar cf blah.tar files ...

This would result in blah.tar which is a mere aggregation of the files ... in uncompressed form.

Then you would do

gzip blah.tar

This would read the contents of blah.tar from disk, compress them through the gzip compression algorithm, write the contents to blah.tar.gz, then unlink (delete) the file blah.tar.

Now, let's decompress!

Way 1

You have blah.tar.gz, one way or another.

You decide to run:

gunzip blah.tar.gz

This will

READ the 1 GB compressed data contents of blah.tar.gz.
PROCESS the compressed data through the gzip decompressor in memory.
As the memory buffer fills up with "a block" worth of data, WRITE the uncompressed data into the file blah.tar on disk and repeat until all the compressed data is read.
Unlink (delete) the file blah.tar.gz.

Now, you have blah.tar on disk, which is uncompressed but contains one or more files within it, with very low data structure overhead. The file size is probably a couple of bytes larger than the sum of all the file data would be.

You run:

tar xvf blah.tar

This will

READ the 2 GB of uncompressed data contents of blah.tar and the tar file format's data structures, including information about file permissions, file names, directories, etc.
WRITE the 2 GB of data plus the metadata to disk. This involves: translating the data structure / metadata information into creating new files and directories on disk as appropriate, or rewriting existing files and directories with new data contents.

The total data we READ from disk in this process was 1 GB (for gunzip) + 2 GB (for tar) = 3 GB.

The total data we WROTE to disk in this process was 2 GB (for gunzip) + 2 GB (for tar) + a few bytes for metadata = about 4 GB.

Way 2

You have blah.tar.gz, one way or another.

You decide to run:

tar xvzf blah.tar.gz

This will

READ the 1 GB compressed data contents of blah.tar.gz, a block at a time, into memory.
PROCESS the compressed data through the gzip decompressor in memory.
As the memory buffer fills up, it will pipe that data, in memory, through to the tar file format parser, which will read the information about metadata, etc. and the uncompressed file data.
As the memory buffer fills up in the tar file parser, it will WRITE the uncompressed data to disk, by creating files and directories and filling them up with the uncompressed contents.

The total data we READ from disk in this process was 1 GB of compressed data, period.

The total data we WROTE to disk in this process was 2 GB of uncompressed data + a few bytes for metadata = about 2 GB.

If you notice, the amount of disk I/O in Way 2 is identical to the disk I/O performed by, say, the Zip or 7-Zip programs, adjusting for any differences in compression ratio.

And if compression ratio is your concern, use the Xz compressor to encapsulate tar, and you have LZMA2'ed TAR archive, which is just as efficient as the most advanced algorithm available to 7-Zip :-)

Part 2: Features

tar stores Unix permissions within its file metadata, and is very well known and tested for successfully packing up a directory with all kinds of different permissions, symbolic links, etc. There are more than a few instances where one might need to glob a bunch of files into a single file or stream, but not necessarily compress it (although compression is useful and often used).

Part 3: Compatibility

Many tools are distributed in source or binary form as .tar.gz or .tar.bz2, because it is a "lowest common denominator" file format: much like most Windows users have access to .zip or .rar decompressors, most Linux installations, even the most basic, will have access to at least tar and gunzip, no matter how old or pared down. Even Android firmwares have access to these tools.

New projects targeting audiences running modern distributions may very well distribute in a more modern format, such as .tar.xz (using the Xz (LZMA) compression format, which compresses better than gzip or bzip2), or .7z, which is similar to the ZIP or RAR file formats in that it both compresses and specifies a layout for encapsulating multiple files into a single file.

You don't see .7z used more often for the same reason that music isn't sold from online download stores in brand new formats like Opus, or video in WebM. Compatibility with people running ancient or very basic systems.

Solution 2

This has been answered on Stack Overflow.

bzip and gzip work on single files, not groups of files. Plain old zip (and pkzip) operate on groups of files and have the concept of the archive built-in.

The *nix philosophy is one of small tools that do specific jobs very well and can be chained together. That's why there's two tools here that have specific tasks, and they're designed to fit well together. It also means you can use tar to group files and then you have a choice of compression tool (bzip, gzip, etc).

Many tools are distributed in source or binary form as .tar.gz or .tar.bz2, because it is a "lowest common denominator" file format: much like most Windows users have access to .zip or .rar decompressors, most Linux installations, even the most basic, will have access to at least tar and gunzip, no matter how old or pared down. Even Android firmwares have access to these tools.

New projects targeting audiences running modern distributions may very well distribute in a more modern format, such as .tar.xz (using the Xz (LZMA) compression format, which compresses better than gzip or bzip2), or .7z, which is similar to the ZIP or RAR file formats in that it both compresses and specifies a layout for encapsulating multiple files into a single file.

Solution 3

Tar has a rich set of operations and modifiers that know all about Unix files systems. It knows about Unix permissions, about the different times associated with files, about hard links, about softlinks (and about the possibility that symbolic links could introduce cycles in the filesystem graph), and allow you to specify several different ways for managing all this data.

Do you want the extracted data to preserve file access times? Tar can do that. To preserve permissions? Tar can do that.
Do you want to preserve symbolic links as symbolic links? Tar does that by default. Want to copy the target instead? Tar can do that.
Do you want to be sure hardlinked data is only stored once (that is, to do the right thing)? Tar does that.
Do you want to handle sparse files well? Tar can do that.
Do you want uncompressed data (why?)? Tar can do that. To compress with gzip? Tar can do that. With bzip2? Tar can do that. With arbitrary external compression programs? Tar can do that.
Do you want to write or recover to/from a raw device? Tar's format handles that fine.
Do you want to add files to an existing archive? Tar can do that. To diff two archive to see what changed? Tar can do that. To update only those parts of the archive that have changed? Tar can do that.
Do you want to be sure you don't archive across more than one filesystem? Tar can do that.
Do you want to grab only files that are newer than your last backup? Tar can do that.
Do you want to preserve user and group names or numbers? Tar can do either one.
Do you need to preserve device nodes (like the files in /dev) so that after extraction, the system will run correctly? Tar can do that.

Tar has been evolving to handle lots and lots of use cases for decades and really does know a lot about the things people want to do with Unix filesystems.

Solution 4

You confuse the two distinct processes of archiving and compression.

Reasons for using an archiver

One reason to use archiving without compression is, for instance, if a bunch of files is copied from one host to another. A command like the following

tar cf - some_directory | ssh host "(cd ~/somewhere | tar xf -)"

can speed up things considerably. If I know that the files cannot be compressed or if SSH is set up with compression, it can save considerable CPU time. Sure, one can use a more modern compressing tool with an archiving function and turn off the compression. The advantage of tar is, that I can expect it to be available on every system.

Reasons for using an archiver with gzip compression

One reason that I use tar with gzip is: speed! If I want to transfer a few GiB of text files from one place to another, I don't care about squeezing out the last bytes, since the compression is only used for transit, not for long-term storage. In those cases I use gzip, which doesn't max out the CPU (in contrast to 7-Zip, for instance), which means that I'm I/O bound again and not CPU bound. And again: gzip can be considered available everywhere.

Reasons for using tar in favour of scp, rsync, etc.

It beats scp if you have a lot of small files to copy (for example, a mail directories with hundred thousands of files). rsync, awesome as it is, might not be available everywhere. Further, rsync only really pays off if part of the files - or an older version- - is already present on the destination. For the initial copy tar is the fastest, with compression or without, depending on the actual data.

Solution 5

Adding to the other good answers here, I prefer the combination tar + gzip|bzip2|xz mainly because these compressed files are like streams, and you can pipe them easily.

I need to uncompress a file available in the internet. With either zip or rar formats I have to download it first and then uncompress it. With tar.{gz,bz2,xz} I can download and uncompress in the same step, without need to having the compressed archive phisically on disk:

curl -s http://example.com/some_compressed_file.tar.gz | tar zx

This will leave just the uncompressed files in my disk, and will speed up the whole process, because I am not waisting time first downloading the entire file and after the download finishes I uncompress it. Instead, I am uncompressing it while it is downloading. You cannot do this with zip or rar files.

View more solutions

44,078

MarcusJ

Updated on September 18, 2022

Comments

MarcusJ over 1 year
I know that tar was made for tape archives back in the day, but today we have archive file formats that both aggregate files and perform compression within the same logical file format.

Questions:
- Is there a performance penalty during the aggregation/compression/decompression stages for using tar encapsulated in gzip or bzip2, when compared to using a file format that does aggregation and compression in the same data structure? Assume the runtime of the compressor being compared is identical (e.g. gzip and Deflate are similar).
- Are there features of the tar file format that other file formats, such as .7z and .zip do not have?
- Since tar is such an old file format, and newer file formats exist today, why is tar (whether encapsulated in gzip, bzip2 or even the new xz) still so widely used today on GNU/Linux, Android, BSD, and other such UNIX operating systems, for file transfers, program source and binary downloads, and sometimes even as a package manager format?
- Griffin about 11 years
  
  It's a very good question. I too highly dislike their whole operation of installing software with either odd names or that I can't simply apt-get. Only reason why I can see it getting downvoted is that this is more of a question for Unix/Linux. However SU should accept this.
- user1686 about 11 years
  
  @Griffin: The question is not about installing software from tarballs. It is about using the Tar format (e.g. over Zip or RAR)
- allquixotic about 11 years
  
  I disagree that it "wastes time". If you mean performance, there is no actual performance penalty for tar as the format is very efficient. If you mean it wastes your time, I don't see how tar xvzf is harder than 7z -x...
- MarcusJ about 11 years
  
  Allquixotic, I mean that you have to extract the archive twice, the first time to extract the tar, adn the second to extract from the tar.
- psusi about 11 years
  
  He seems to be lamenting the fact that tar does not store a catalog at the start, so gui compression tools that want to list the contents prior to extracting have to decompress the whole tar just to list the contents, then they decompress it again when extracting.
- Kruug about 11 years
  
  @MarcusJ Usually, the tar.xx formats have a one-line solution. If you have tar.gz, for example, you could use tar -xzf <file>.tar.gz and it will decompress and extract all at once.
- MarcusJ about 11 years
  
  psusi, no no no, I'm talking about the fact that tar needs a separate compressor and decompressor, so basically when you open a tar.gz, you need to extract BOTH the gz file to get the tar, then have to extract the tar file, instead of merely decompressing something like a 7z - in one step. It takes more cpu power to do it like this, and seems redundant.
- psusi about 11 years
  
  @MarcusJ, both steps have to be done either way, so it takes no more cpu power.
- MarcusJ about 11 years
  
  Not to say you're wrong or anything, but how would a 7z require both steps? It would merely load the file, then decompress whatever was selected to be decompressed. :/
- mike3996 about 11 years
  
  @MarcusJ: you think 7z somehow magically knows where each file starts in an archive? Besides, the usual compression algorithms (gzip, bzip2) work with streaming the content: no need to complete 100% the first stage before next.
- psusi about 11 years
  
  Which step do you think it doesn't have to do? It has to parse the file format, and it has to decompress the content. The difference is really just in the order the two are done. tar decompresses the content first, then parses the archive. 7zip parses the archive, then decompresses the file content ( the metadata is uncompressed ).
- allquixotic about 11 years
  
  Also @MarcusJ you seem to be confusing two different things: when you do tar xvzf, the uncompressed data is not written to hard disk in .tar format! You're right that if you ran gunzip blah.tar.gz and then tar xf blah.tar, it would write the data to disk twice (once as a .tar and again as files in the filesystem), but nobody actually does it that way. The tar xzf uses a UNIX Pipe (basically a memory copy) to transfer the uncompressed data from gzip (or whatever compressor) to tar, so the data is not written to disk in .tar format.
- PPC about 11 years
  
  One thing I know is that tar (especially compressed) behaves awfully when it comes to data corruption. Small redundancy / recovery data added by modern formats is worth gold
- user239558 about 11 years
  
  tar is superior for streaming. Unlike zip, you don't have to wait for the central directory. For archiving, this can also be a disadvantage (slower to list contents). tar xvzf will also automatically use two processes/cores, so it's not inefficient to split the two processes.
- André Paramés about 11 years
  
  @PPC: that's what PAR files are for. Tar is an unix utility; as such, error correction is best left to dedicated tools.
- Thomas Andrews about 11 years
  
  Hmm, tar keeps soft links. I can recall back in the doing: "tar cf - | ( cd /somewhere/else ; tar xf -)" rather a lot because "cp" didn't have a flag to respect soft links. Don't know if it does today - if I encountered the problem, I'd probably just use 'tar' this way again.
- Keith Thompson about 11 years
  
  @Kruug: GNU tar automatically applies the z (or j, or J) flag: tar xf foo.tar.gz. It does this based on the actual content of the file, not its name, so it still works even if a gzipped tar file is named foo.tar.
- o0'. about 10 years
  
  @psusi however, if you want to extract just a single file, AFAIK tar have to decompress the whole archive first, while another format could only decompress the target file instead.
Griffin about 11 years

It's free software - So are a lot of them It's good at what it does - Hardly compared to other stuff It's well documented and has many features - Features are hardly used and detestably easy to use. It supports several compression algorithms - Not as many as some others
SnakeDoc about 11 years

the Unix Gods created it - therefore we must use it!
LawrenceC about 11 years

Tar also stores UNIX permissions natively, and is very well known and tested. There's more than a few instances where one might need to glob a bunch of files into a single file or stream, but not necessarily compress it.
allquixotic about 11 years

Hi @Kruug, I edited your post just to give a practical perspective on why people still choose to use these formats when they have a choice to use something else. I didn't change the text you already had. This is just to ensure that what appears to be the canonical answer to this question will have the full picture. Feel free to edit my edit if you want :)
SnakeDoc about 11 years

@allquixotic inception anyone? Edit the edit of and edit so you can edit an edit...
Kruug about 11 years

@allquixotic I feel a bit bad, getting all of these upvotes when at least 50% of the answer was yours.
psusi about 11 years

I don't know about rar ( it's a terrible program that only seems to have become popular with pirates beacuse of its ability to split into multiple smaller files ), but you can stream zip just fine. The man page even mentions it. It also has the advantage of being able to extract or update files from the middle of a large archive efficiently, though tar tends to get slightly better compression. Compression vs. random access is a tradeoff.
MarcusJ about 11 years

But if you're going to archive, why not compress as well? Okay, yeah it can save time for files that aren't easily compressed, but then archivers should probably know that music for example, aren't very compressible, except for the headers.
Carlos Campderrós about 11 years

@psusi incorrect. You can do hacks like this, but what it does is download all the file in memory and then unzip it, instead of unzipping while downloading. And funzip just extracts the first file in the zipfile, not all.
Ярослав Рахматуллин about 11 years

This answer is definitely a case of "I'm sometimes blown away by undeserved upvotes". It does not address the core issue of the question which is with listing the contents of compressed tar and it's not even an original answer!
psusi about 11 years

Ahh, while you can pipe the output of zip, it appears that unzip is buggy and can't read from stdin. This is a defect in the program though, not a limitation of the file format.
allquixotic about 11 years

For performance reasons it is often easier to use uncompressed file aggregation when sending data over very high bandwidth network links that exceed the speed at which the compressor can compress data. This is achievable for example with Gigabit Ethernet; only a few well-designed compression algorithms, which also have very poor compression ratio, can compress data that fast even on a large desktop CPU. On an embedded device you have even less CPU time to work with.
terdon about 11 years

@MarcusJ there are also all sorts of "uncompressible" binary file formats, running them through a compressor is a waste of time/CPU. tar however will archive them, making their transfer easier and faster. As you said, compressors can know about some of them (mp3 for example) and guess some others from the magic number , but not all.
allquixotic about 11 years

@Kruug Don't feel bad now; I posted my own answer ;-D
Lucas Holt about 11 years

Luckily tar is not limited to just GNU versions. While GNU tar is certainly a good piece of software, libarchive + related front ends are much faster and easy to embed in other software projects. You can make an argument for tar without turning it into a licensing fight.
Ярослав Рахматуллин about 11 years

@Lucas Holt Very true, I mention it in parentheses only because it's the only version I'm familiar with.
titaniumdecoy about 11 years

WebM might not be the best example since it is technically inferior to the more popular H.264 codec.
Andre Holzner about 11 years

not only is this speeding up things but it also allows preserving file ownership, timestamps and attributes (if the user privileges allow it)
Dietrich Epp about 11 years

It seems easier to use the pipe | ssh host tar x -C '~/somewhere'
Marco about 11 years

@DietrichEpp That doesn't work on Solaris.
user239558 about 11 years

@AndreHolzner Right. I often do tar cf - . | (cd ~/somewhere; tar xvf -). It is really useful not to have to wait until the central index is written (like for example in a zip file).
Roy Tinker about 11 years

@ЯрославРахматуллин: This answer provides the rationale for using tar from a Unix/Linux user's perspective, which readers are finding helpful. It deserves my upvote.
Stu about 11 years

No offense, but when on Earth is this an issue nowadays?
Chris Stratton about 11 years

Actually, most stock Android firmwares have an unzip and use renamed and optimized zip files as their application delivery format, and they may have a gzip, but they do not have a tar. Alternate installations often have a more complete unix toolset.
Mark Adler about 11 years

zip can store and restore the Unix permissions. The zip and unzip utilities from InfoZIP normally distributed with Unix system does this.
Mark Adler about 11 years

zip does not compress the file in 32K chunks. You are confusing the sliding window size of 32K with how the compression is done.
Mark Adler about 11 years

So much misinformation in one answer.
wim about 11 years

-1 for great justice. this should have been a comment.
Konrad Rudolph about 11 years

Why would you use this rather than scp, rsync, SFTP or any of the other file transfer protocols though?
michael about 11 years

I don't buy the legacy/lowest common denominator argument; I remember on new systems (sun) frequently having to download gzip/gunzip (from sunfreeware) just to install other tar.gz packaged software (plus gnu tar, since sun's tar sucked). For legacy/lower-common denominator, you had tar.Z (compress/uncompress). The progression of utilities has been a constant stream (no pun intended) of change & improvement: Z => zip => gz => bz2 => 7z => xz (or whatever order you prefer). As for tar's role, some utils un/compress only, and still require tar to bundle up file hierarchies.
michael about 11 years

having used other tar's, gnu tar is the only one I would trust to work consistently & correctly. Especially on solaris, but also a bit cautious with native (proprietary) tar's on hp-ux/aix & z/os.
Carlos Campderrós about 11 years

@Stu just to clarify, is not an issue, is just optimizing your time (I don't care about space if that's what you thought)
michael about 11 years

I use tar on the other end (the sending side, rather than the receiving side), since gnu tar has really flexible options for including/excluding files, over, say scp -r, e.g,. tar -czh --exclude=.svn --exclude=.git --exclude=*~ --exclude=*.bak -f - some_dir | ssh user@rmt_host "cat > ~/some_dir.tgz" (avoids creating local tar.gz before sending, too)
Carlos Campderrós about 11 years

Both sides work: You can tar on one side and untar in the other, too: tar zc /some/folder | ssh user@host "cd /other/folder && tar zx"
xorsyst about 11 years

That may be implementation-specific then, it certainly isn't supported by the original pkzip.
Massey101 about 11 years

Downvote. Sarcasm is inappropriate on Stackexchange. People do actually trust these answers.
Ярослав Рахматуллин about 11 years

I'm not sarcastic. I like RMS and the way he carries forth his believes.
Mark Adler about 11 years

Yes, the software has to be written to support it. The zip format supports it completely, with data descriptors that can follow the compressed data with the lengths and CRC.
Ilmari Karonen about 11 years

You don't need GNU tar to use an arbitrary compressor: just tell tar to write the archive to stdout with f - and pipe it to the compressor.
JFW about 11 years

Kudos for great answer with all the content separated under three distinct headers.
F. Erken about 11 years

@psusi as I remember from old times when using pkzip to store files on multiple floppies, zip store catalog at end of archive. It always request last floppy for start extraction or show catalog. So en.wikipedia.org/wiki/File:ZIP-64_Internal_Layout.svg
psusi about 11 years

@mmv-ru, oh yea, it is backwards, I forgot about that.
psusi about 11 years

@michael_n, the progression of compression tools has continued, yet we still use tar as the container format. The question made it clear it was talking about that, not the compression.
psusi about 11 years

@MarkAdler, it appears zip has been extended to store the file mode, but not owner. 7zip still warns it does not handle unix permissions. Zip ( and cab ) does compress 32k blocks at a time, else it could not efficiently extract a file from the middle of a large archive, which is the problem tar has. 7z, rar, and dar have an option to use the blocking method ( like zip ) or "solid" mode ( like tar ), as they call it. Re: -9, it seems I was thinking of bzip2 and lzma, and gzip uses a more simplistic system, but it does not use a fixed 32k dictionary, but the window limits it near there.
psusi about 11 years

@MarkAdler, what software? Infozip doesn't support unzipping from a pipe.
timonsku about 11 years

I highly disagree that xz achieves better compression than a .7z archive. The 7-zip file format supports a wide variety of compression algorithms including LZMA(2) which is its "home" compression algorithm and was developed by the the 7-zip developer. From the xz wiki article: "xz is essentially a stripped down version of the 7-Zip program, which uses its own file format rather than the .7z format used by 7-Zip which lacks support for Unix-like file system metadata."
allquixotic about 11 years

XZ uses LZMA2 as its compression algorithm. The only difference is that 7-zip has a different metadata format. The mathematics used to compress the files is exactly the same as LZMA2. Certain input data can yield better compression ratios if you use PPMD compression in 7-zip, but the runtime and memory costs of PPMD far exceed any other compression algorithm in existence, both for compression and decompression.
allquixotic about 11 years

LZMA on the other hand decompresses very fast (almost as fast as zip, and much much faster than it compresses). PPMD, while it may save a few kilobytes on several dozen megabytes of data, will take gigabytes of memory to decompress, and will decompress just as slowly as it compresses (slooooooooooooooooooooooooooow). So, throwing out ppmd as being impractical, Xz and 7-Zip are identical in compression capability, varying insignificantly based on the way they store file structure and metadata.
Mark Adler about 11 years

zlib.net/sunzip033.c.gz
Mark Adler about 11 years

The zip format can store both the uid and gid.
Mark Adler about 11 years

Also Info-ZIP's zip supports compression to a stream.
michael about 11 years

@psusi yeah, I know / understand / agree / etc. And now (gnu) tar compresses, too, in a variety of formats (gz/bz/xz/yada-yada-yada-z): time rolls on, lines blur, things change, and sun's tar still doesn't handle long file/path names. (...arguably for "posix compliance", but no need to delve into pedantry (my fault) and lose the larger point (whatever it was, i forget))
kriss about 11 years

@Konrad: you can perform that kind of transfer using tar using very simple network tools like netcat. scp, rsync sftp or such implies running much more complex client and server software.
Christian about 11 years

You practically never see uncompressed tar files and there's a reason for that. tar uses very large chunks, meaning that you get a lot of padding at the end of files. To get rid of all these zeros, it almost always pays to just use gzip without giving it a second thought.
slhck about 11 years

@titaniumdecoy Have you noticed that it was allquixotic who originally wrote that part and edited it into Kruug's answer?
psusi about 11 years

@MarkAdler, I once worked on the cab extractor for ReactOS, trust me, it it compresses 32k at a time, either combining smaller files or splitting larger ones as needed.
Mark Adler about 11 years

When working on the CAB format, it might have been a good idea to spend some time studying the cab format specification. The 32K CFDATA blocks are not random access entry points. The random access points are at the start of CAB "folders", which consist of a series of CFDATA blocks. From the specification: "By default, all files are added to a single folder (compression history) in the cabinet." So a non-default option would be needed for CAB file to have any random access midpoints at all.
Mark Adler about 11 years

Your edited answer has improved, but is still chock full of misinformation. zip does not compress in 32K chunks, and does not provide access to parts of files without having to decompress the entire file. "It also prevents the compressor from building up a very large dictionary before it is restarted." is nonsensical. There is no building up of anything. The deflate dictionary is simply the 32K bytes that precede the next byte to be compressed. Once you get past the first 32K, the dictionary is always the same size, there is no "building up", and the compression speed does not change.
Mark Adler about 11 years

An amusing exception is that the gzip source code is available as a naked tar, for obvious reasons.
titaniumdecoy about 11 years

Thanks for pointing that out, I didn't notice. However it seems a bit silly to me to have an identical block of text in two different answers on this page.
ctype.h about 11 years

CW stands for Community Wiki. See also What are "Community Wiki" posts?.
ctype.h about 11 years

I guess it is CW because the question has more than 15 answers. When you posted this answer, because it is the 15th, the question and all of the answers were marked CW.
psusi about 11 years

Because the data stream is broken into a series of CFDATA blocks that are limited in size, that does in fact, provide for random access, since you can seek to any CFDATA block and start decompression there. The folder mechanism is a seemingly useless abstraction. As I said, the deflate dictionary is not strictly limited to 32k, though in practice it tends to not grow much larger due to the 32k distance limit, but inf-zip allows for bzip2, which has no such limit. Whatever the limits of the compression algorithm, restarting it does reduce compression ratios.
Mark Adler about 11 years

No, you cannot start decompressing at any CFDATA block. Read the specification, which is very clear on this point. Within a folder, each CFDATA block can and does use the previous CFDATA blocks as history for compression. The folder is the only abstraction in the specification that defines where you can start decompressing, so it is not only useful, but essential for the random extraction application you are calling attention to in your answer.
Mark Adler about 11 years

The deflate dictionary is strictly limited to 32K. It does not "grow" once you're at least 32K into the stream. From there on it is always exactly 32K. bzip2 certainly does have a limit of 900K of history, which is not a sliding dictionary but rather a block on which the BWT transform is applied. Each block is compressed independently, and cannot make use of the information in previous blocks.
Mark Adler about 11 years

Since there seems to be no limit to the amount of misinformation you can fabricate, this is no longer productive. I am done commenting on this answer and related comments. Thank you and good night.
MarcusJ about 11 years

Really good comment, I hadn't even thought of that, and that's a REALLY good point to make.
allquixotic about 11 years

I fail to see how this answer says something that none of the other answers do, other than directly quoting the questions (which I wrote, BTW, because the original revision of the question was horrible enough to be closed as NARQ). Nice try though.
Mark Adler about 11 years

Um, ok. Whatever you'd like to think is fine. Your answer nor any other answer seems to address whether there is a performance penalty. Your answer does not address the noticeable compression difference, though others do. Since yours does not actually address performance (your performance section is actually about workflow, nothing about performance), no other answer answers everything in one place. It is interesting that you wrote the performance penalty question, but you did not answer it! Go figure.
Mark Adler about 11 years

By the way, your workflow discussion is about something no one ever does, which is to write a tar file to a disk and then compress it. tar is always used either calling the compression program directly, or directly into a pipe to a compression program.
Warren P about 11 years

Sarcasm and sincerity travels poorly across plain text. We usually guess about it from your tone of voice. Guessing someone is serious or not on the internet is a bit difficult.
Ярослав Рахматуллин about 11 years

@Warren P thanks for the comment. I'll try to maintain a neutral tone in the future.
Lajos Veres about 10 years

I don't have enough repo to add an answer so I write here: AFAIK tar's fault tolerance is much higher than other similar tools. If you have to save something from a not so reliable medium (for example network fs.) tar is probably the best tool for saving as much data as it is possible. Rsync and other tools failed when the first error happened, but with tar we were able to pass single errors. (It was a not so critical daily backup.)
Steve over 9 years

"Do you want uncompressed data (why?)?" I use tar very often to copy a filesystem tree from one place to another and preserve permissions, etc., and compression in this case just takes extra CPU cycles. E.g. tar cf - * | tar xf - -C /somewhere.
Aaron over 9 years

Additionally, you would want a .tar file when the destination filesystem performs de-duplication. Creating compressed archives on a filesystem that performs de-duplication will substantially lower the dedupe ratio. Example: We once deleted a $10,000.00 tar.gz file; meaning, it was taking up $10k worth of storage space because someone used compression.
underscore_d over 8 years

^ i think "higher fault tolerance" there actually means 'does not notice errors and will blindly stumble on, whether you want it to or not'!
agc almost 8 years

Re "So obsessed"... imagine you're stranded in warzone with a single hardened laptop, and the undersized 20G hard drive's nearly full, maybe a Gig left, and hearing the gunfire from far off, you'd really like to browse a 100MB .PDF manual that shows how to repair the jeep, but the file is in a 2 Gig .tgz file. And the laptop runs a closed source strange proprietary OS, and you don't have root access to delete system files, not that it'd be obvious how to delete 4G+ without breaking the dearchiver or the PDF viewer. If you could just extract that 100MB file...
gaborous almost 8 years

I disagree that dar doesn't provide enough of a benefit to justify the change: it is way more robust and way less susceptible to corruption (ie, tar is a solid archive whereas dar is not and thus allows partial file extraction, even if corrupted, whereas you lose all your files in a corrupted tar). In addition, most modern archiving features are natively supported, such as encrypting. So certainly the benefits are huge, and certainly justify the change. So the reason it has not been more widely adopted yet is to find elsewhere (lack of easy GUI tools? Inertia?).
gaborous almost 8 years

So I stand with @MarkAdler, this answer is based on incorrect premisses: tar does not allow partial file extraction, in fact it's the opposite, if you tar your files before feeding them to zip/Deflate, you lose the ability to partially extract files without uncompressing the archive, because tar can only make solid archives.
gaborous almost 8 years

This answer is the only one that makes sense. Thank you for posting it.
gaborous almost 8 years

This answers why tar fits in the archiving ecosystem (ie, to aggregate files together, providing a performance boost and some other benefits like permissions saving), but it does not address why modern alternatives such as dar aren't used in place. In other words, this answer justifies the usage of files aggregators, but not of the tar software in itself.
phuclv almost 6 years

@Steve CPU cycles may be cheaper than disk IO for algorithms like LZ4 or LZO. That's why they're used in zram, and transparent compression file systems like NTFS, ZFS, Btrfs... so sometimes it's actually faster than to compress since the amount of disk IO is greatly reduced