What is the advantage of using 'tar' today?
Solution 1
Part 1: Performance
Here is a comparison of two separate workflows and what they do.
You have a file on disk blah.tar.gz
which is, say, 1 GB of gzip-compressed data which, when uncompressed, occupies 2 GB (so a compression ratio of 50%).
The way that you would create this, if you were to do archiving and compression separately, would be:
tar cf blah.tar files ...
This would result in blah.tar
which is a mere aggregation of the files ...
in uncompressed form.
Then you would do
gzip blah.tar
This would read the contents of blah.tar
from disk, compress them through the gzip compression algorithm, write the contents to blah.tar.gz
, then unlink (delete) the file blah.tar
.
Now, let's decompress!
Way 1
You have blah.tar.gz
, one way or another.
You decide to run:
gunzip blah.tar.gz
This will
- READ the 1 GB compressed data contents of
blah.tar.gz
. - PROCESS the compressed data through the
gzip
decompressor in memory. - As the memory buffer fills up with "a block" worth of data, WRITE the uncompressed data into the file
blah.tar
on disk and repeat until all the compressed data is read. - Unlink (delete) the file
blah.tar.gz
.
Now, you have blah.tar
on disk, which is uncompressed but contains one or more files within it, with very low data structure overhead. The file size is probably a couple of bytes larger than the sum of all the file data would be.
You run:
tar xvf blah.tar
This will
- READ the 2 GB of uncompressed data contents of
blah.tar
and thetar
file format's data structures, including information about file permissions, file names, directories, etc. - WRITE the 2 GB of data plus the metadata to disk. This involves: translating the data structure / metadata information into creating new files and directories on disk as appropriate, or rewriting existing files and directories with new data contents.
The total data we READ from disk in this process was 1 GB (for gunzip) + 2 GB (for tar) = 3 GB.
The total data we WROTE to disk in this process was 2 GB (for gunzip) + 2 GB (for tar) + a few bytes for metadata = about 4 GB.
Way 2
You have blah.tar.gz
, one way or another.
You decide to run:
tar xvzf blah.tar.gz
This will
- READ the 1 GB compressed data contents of
blah.tar.gz
, a block at a time, into memory. - PROCESS the compressed data through the
gzip
decompressor in memory. - As the memory buffer fills up, it will pipe that data, in memory, through to the
tar
file format parser, which will read the information about metadata, etc. and the uncompressed file data. - As the memory buffer fills up in the
tar
file parser, it will WRITE the uncompressed data to disk, by creating files and directories and filling them up with the uncompressed contents.
The total data we READ from disk in this process was 1 GB of compressed data, period.
The total data we WROTE to disk in this process was 2 GB of uncompressed data + a few bytes for metadata = about 2 GB.
If you notice, the amount of disk I/O in Way 2 is identical to the disk I/O performed by, say, the Zip
or 7-Zip programs, adjusting for any differences in compression ratio.
And if compression ratio is your concern, use the Xz
compressor to encapsulate tar
, and you have LZMA2'ed TAR archive, which is just as efficient as the most advanced algorithm available to 7-Zip :-)
Part 2: Features
tar
stores Unix permissions within its file metadata, and is very well known and tested for successfully packing up a directory with all kinds of different permissions, symbolic links, etc. There are more than a few instances where one might need to glob a bunch of files into a single file or stream, but not necessarily compress it (although compression is useful and often used).
Part 3: Compatibility
Many tools are distributed in source or binary form as .tar.gz or .tar.bz2, because it is a "lowest common denominator" file format: much like most Windows users have access to .zip or .rar decompressors, most Linux installations, even the most basic, will have access to at least tar and gunzip, no matter how old or pared down. Even Android firmwares have access to these tools.
New projects targeting audiences running modern distributions may very well distribute in a more modern format, such as .tar.xz (using the Xz (LZMA) compression format, which compresses better than gzip or bzip2), or .7z, which is similar to the ZIP or RAR file formats in that it both compresses and specifies a layout for encapsulating multiple files into a single file.
You don't see .7z used more often for the same reason that music isn't sold from online download stores in brand new formats like Opus, or video in WebM. Compatibility with people running ancient or very basic systems.
Solution 2
This has been answered on Stack Overflow.
bzip and gzip work on single files, not groups of files. Plain old zip (and pkzip) operate on groups of files and have the concept of the archive built-in.
The *nix philosophy is one of small tools that do specific jobs very well and can be chained together. That's why there's two tools here that have specific tasks, and they're designed to fit well together. It also means you can use tar to group files and then you have a choice of compression tool (bzip, gzip, etc).
Many tools are distributed in source or binary form as .tar.gz or .tar.bz2, because it is a "lowest common denominator" file format: much like most Windows users have access to .zip or .rar decompressors, most Linux installations, even the most basic, will have access to at least tar
and gunzip
, no matter how old or pared down. Even Android firmwares have access to these tools.
New projects targeting audiences running modern distributions may very well distribute in a more modern format, such as .tar.xz
(using the Xz (LZMA) compression format, which compresses better than gzip or bzip2), or .7z, which is similar to the ZIP or RAR file formats in that it both compresses and specifies a layout for encapsulating multiple files into a single file.
You don't see .7z used more often for the same reason that music isn't sold from online download stores in brand new formats like Opus, or video in WebM. Compatibility with people running ancient or very basic systems is important.
Solution 3
Tar has a rich set of operations and modifiers that know all about Unix files systems. It knows about Unix permissions, about the different times associated with files, about hard links, about softlinks (and about the possibility that symbolic links could introduce cycles in the filesystem graph), and allow you to specify several different ways for managing all this data.
Do you want the extracted data to preserve file access times? Tar can do that. To preserve permissions? Tar can do that.
Do you want to preserve symbolic links as symbolic links? Tar does that by default. Want to copy the target instead? Tar can do that.
Do you want to be sure hardlinked data is only stored once (that is, to do the right thing)? Tar does that.
Do you want to handle sparse files well? Tar can do that.
Do you want uncompressed data (why?)? Tar can do that. To compress with gzip? Tar can do that. With bzip2? Tar can do that. With arbitrary external compression programs? Tar can do that.
Do you want to write or recover to/from a raw device? Tar's format handles that fine.
Do you want to add files to an existing archive? Tar can do that. To diff two archive to see what changed? Tar can do that. To update only those parts of the archive that have changed? Tar can do that.
Do you want to be sure you don't archive across more than one filesystem? Tar can do that.
Do you want to grab only files that are newer than your last backup? Tar can do that.
Do you want to preserve user and group names or numbers? Tar can do either one.
Do you need to preserve device nodes (like the files in
/dev
) so that after extraction, the system will run correctly? Tar can do that.
Tar has been evolving to handle lots and lots of use cases for decades and really does know a lot about the things people want to do with Unix filesystems.
Solution 4
You confuse the two distinct processes of archiving and compression.
Reasons for using an archiver
One reason to use archiving without compression is, for instance, if a bunch of files is copied from one host to another. A command like the following
tar cf - some_directory | ssh host "(cd ~/somewhere | tar xf -)"
can speed up things considerably. If I know that the files cannot be
compressed or if SSH is set up with compression, it can save considerable CPU
time. Sure, one can use a more modern compressing tool with an archiving function
and turn off the compression. The advantage of tar
is, that I can expect it
to be available on every system.
Reasons for using an archiver with gzip compression
One reason that I use tar
with gzip
is: speed!
If I want to transfer a few GiB of text files from one place to another, I
don't care about squeezing out the last bytes, since the compression is only
used for transit, not for long-term storage. In those cases I use gzip
,
which doesn't max out the CPU (in contrast to 7-Zip, for instance), which
means that I'm I/O bound again and not CPU bound. And again: gzip
can be
considered available everywhere.
Reasons for using tar
in favour of scp
, rsync
, etc.
It beats scp
if you have a lot of small files to copy (for example, a mail directories with hundred thousands of files). rsync
, awesome as it is, might not be available everywhere. Further, rsync
only really pays off if part of the files - or an older version- - is already present on the destination. For the initial copy tar
is the fastest, with compression or without, depending on the actual data.
Solution 5
Adding to the other good answers here, I prefer the combination tar
+ gzip|bzip2|xz
mainly because these compressed files are like streams, and you can pipe them easily.
I need to uncompress a file available in the internet. With either zip
or rar
formats I have to download it first and then uncompress it. With tar.{gz,bz2,xz}
I can download and uncompress in the same step, without need to having the compressed archive phisically on disk:
curl -s http://example.com/some_compressed_file.tar.gz | tar zx
This will leave just the uncompressed files in my disk, and will speed up the whole process, because I am not waisting time first downloading the entire file and after the download finishes I uncompress it. Instead, I am uncompressing it while it is downloading. You cannot do this with zip
or rar
files.
Related videos on Youtube
MarcusJ
Updated on September 18, 2022Comments
-
MarcusJ over 1 year
I know that
tar
was made for tape archives back in the day, but today we have archive file formats that both aggregate files and perform compression within the same logical file format.Questions:
Is there a performance penalty during the aggregation/compression/decompression stages for using
tar
encapsulated ingzip
orbzip2
, when compared to using a file format that does aggregation and compression in the same data structure? Assume the runtime of the compressor being compared is identical (e.g. gzip and Deflate are similar).Are there features of the
tar
file format that other file formats, such as.7z
and.zip
do not have?Since
tar
is such an old file format, and newer file formats exist today, why istar
(whether encapsulated ingzip
,bzip2
or even the newxz
) still so widely used today on GNU/Linux, Android, BSD, and other such UNIX operating systems, for file transfers, program source and binary downloads, and sometimes even as a package manager format?
-
Griffin about 11 yearsIt's a very good question. I too highly dislike their whole operation of installing software with either odd names or that I can't simply apt-get. Only reason why I can see it getting downvoted is that this is more of a question for Unix/Linux. However SU should accept this.
-
user1686 about 11 years@Griffin: The question is not about installing software from tarballs. It is about using the Tar format (e.g. over Zip or RAR)
-
allquixotic about 11 yearsI disagree that it "wastes time". If you mean performance, there is no actual performance penalty for tar as the format is very efficient. If you mean it wastes your time, I don't see how
tar xvzf
is harder than7z -x
... -
MarcusJ about 11 yearsAllquixotic, I mean that you have to extract the archive twice, the first time to extract the tar, adn the second to extract from the tar.
-
psusi about 11 yearsHe seems to be lamenting the fact that tar does not store a catalog at the start, so gui compression tools that want to list the contents prior to extracting have to decompress the whole tar just to list the contents, then they decompress it again when extracting.
-
Kruug about 11 years@MarcusJ Usually, the tar.xx formats have a one-line solution. If you have tar.gz, for example, you could use
tar -xzf <file>.tar.gz
and it will decompress and extract all at once. -
MarcusJ about 11 yearspsusi, no no no, I'm talking about the fact that tar needs a separate compressor and decompressor, so basically when you open a tar.gz, you need to extract BOTH the gz file to get the tar, then have to extract the tar file, instead of merely decompressing something like a 7z - in one step. It takes more cpu power to do it like this, and seems redundant.
-
psusi about 11 years@MarcusJ, both steps have to be done either way, so it takes no more cpu power.
-
MarcusJ about 11 yearsNot to say you're wrong or anything, but how would a 7z require both steps? It would merely load the file, then decompress whatever was selected to be decompressed. :/
-
mike3996 about 11 years@MarcusJ: you think 7z somehow magically knows where each file starts in an archive? Besides, the usual compression algorithms (gzip, bzip2) work with streaming the content: no need to complete 100% the first stage before next.
-
psusi about 11 yearsWhich step do you think it doesn't have to do? It has to parse the file format, and it has to decompress the content. The difference is really just in the order the two are done.
tar
decompresses the content first, then parses the archive.7zip
parses the archive, then decompresses the file content ( the metadata is uncompressed ). -
allquixotic about 11 yearsAlso @MarcusJ you seem to be confusing two different things: when you do
tar xvzf
, the uncompressed data is not written to hard disk in.tar
format! You're right that if you rangunzip blah.tar.gz
and thentar xf blah.tar
, it would write the data to disk twice (once as a .tar and again as files in the filesystem), but nobody actually does it that way. Thetar xzf
uses a UNIX Pipe (basically a memory copy) to transfer the uncompressed data fromgzip
(or whatever compressor) totar
, so the data is not written to disk in.tar
format. -
PPC about 11 yearsOne thing I know is that
tar
(especially compressed) behaves awfully when it comes to data corruption. Small redundancy / recovery data added by modern formats is worth gold -
user239558 about 11 years
tar
is superior for streaming. Unlikezip
, you don't have to wait for the central directory. For archiving, this can also be a disadvantage (slower to list contents).tar xvzf
will also automatically use two processes/cores, so it's not inefficient to split the two processes. -
André Paramés about 11 years@PPC: that's what PAR files are for. Tar is an unix utility; as such, error correction is best left to dedicated tools.
-
Thomas Andrews about 11 yearsHmm, tar keeps soft links. I can recall back in the doing: "tar cf - | ( cd /somewhere/else ; tar xf -)" rather a lot because "cp" didn't have a flag to respect soft links. Don't know if it does today - if I encountered the problem, I'd probably just use 'tar' this way again.
-
Keith Thompson about 11 years@Kruug: GNU tar automatically applies the
z
(orj
, orJ
) flag:tar xf foo.tar.gz
. It does this based on the actual content of the file, not its name, so it still works even if a gzipped tar file is namedfoo.tar
. -
o0'. about 10 years@psusi however, if you want to extract just a single file, AFAIK tar have to decompress the whole archive first, while another format could only decompress the target file instead.
-
Griffin about 11 yearsIt's free software - So are a lot of them It's good at what it does - Hardly compared to other stuff It's well documented and has many features - Features are hardly used and detestably easy to use. It supports several compression algorithms - Not as many as some others
-
SnakeDoc about 11 yearsthe Unix Gods created it - therefore we must use it!
-
LawrenceC about 11 yearsTar also stores UNIX permissions natively, and is very well known and tested. There's more than a few instances where one might need to glob a bunch of files into a single file or stream, but not necessarily compress it.
-
allquixotic about 11 yearsHi @Kruug, I edited your post just to give a practical perspective on why people still choose to use these formats when they have a choice to use something else. I didn't change the text you already had. This is just to ensure that what appears to be the canonical answer to this question will have the full picture. Feel free to edit my edit if you want :)
-
SnakeDoc about 11 years@allquixotic inception anyone? Edit the edit of and edit so you can edit an edit...
-
Kruug about 11 years@allquixotic I feel a bit bad, getting all of these upvotes when at least 50% of the answer was yours.
-
psusi about 11 yearsI don't know about rar ( it's a terrible program that only seems to have become popular with pirates beacuse of its ability to split into multiple smaller files ), but you can stream zip just fine. The man page even mentions it. It also has the advantage of being able to extract or update files from the middle of a large archive efficiently, though tar tends to get slightly better compression. Compression vs. random access is a tradeoff.
-
MarcusJ about 11 yearsBut if you're going to archive, why not compress as well? Okay, yeah it can save time for files that aren't easily compressed, but then archivers should probably know that music for example, aren't very compressible, except for the headers.
-
Carlos Campderrós about 11 years@psusi incorrect. You can do hacks like this, but what it does is download all the file in memory and then unzip it, instead of unzipping while downloading. And
funzip
just extracts the first file in the zipfile, not all. -
Ярослав Рахматуллин about 11 yearsThis answer is definitely a case of "I'm sometimes blown away by undeserved upvotes". It does not address the core issue of the question which is with listing the contents of compressed tar and it's not even an original answer!
-
psusi about 11 yearsAhh, while you can pipe the output of
zip
, it appears thatunzip
is buggy and can't read from stdin. This is a defect in the program though, not a limitation of the file format. -
allquixotic about 11 yearsFor performance reasons it is often easier to use uncompressed file aggregation when sending data over very high bandwidth network links that exceed the speed at which the compressor can compress data. This is achievable for example with Gigabit Ethernet; only a few well-designed compression algorithms, which also have very poor compression ratio, can compress data that fast even on a large desktop CPU. On an embedded device you have even less CPU time to work with.
-
terdon about 11 years@MarcusJ there are also all sorts of "uncompressible" binary file formats, running them through a compressor is a waste of time/CPU.
tar
however will archive them, making their transfer easier and faster. As you said, compressors can know about some of them (mp3 for example) and guess some others from the magic number , but not all. -
allquixotic about 11 years@Kruug Don't feel bad now; I posted my own answer ;-D
-
Lucas Holt about 11 yearsLuckily tar is not limited to just GNU versions. While GNU tar is certainly a good piece of software, libarchive + related front ends are much faster and easy to embed in other software projects. You can make an argument for tar without turning it into a licensing fight.
-
Ярослав Рахматуллин about 11 years@Lucas Holt Very true, I mention it in parentheses only because it's the only version I'm familiar with.
-
titaniumdecoy about 11 yearsWebM might not be the best example since it is technically inferior to the more popular H.264 codec.
-
Andre Holzner about 11 yearsnot only is this speeding up things but it also allows preserving file ownership, timestamps and attributes (if the user privileges allow it)
-
Dietrich Epp about 11 yearsIt seems easier to use the pipe
| ssh host tar x -C '~/somewhere'
-
Marco about 11 years@DietrichEpp That doesn't work on Solaris.
-
user239558 about 11 years@AndreHolzner Right. I often do
tar cf - . | (cd ~/somewhere; tar xvf -)
. It is really useful not to have to wait until the central index is written (like for example in a zip file). -
Roy Tinker about 11 years@ЯрославРахматуллин: This answer provides the rationale for using
tar
from a Unix/Linux user's perspective, which readers are finding helpful. It deserves my upvote. -
Stu about 11 yearsNo offense, but when on Earth is this an issue nowadays?
-
Chris Stratton about 11 yearsActually, most stock Android firmwares have an unzip and use renamed and optimized zip files as their application delivery format, and they may have a gzip, but they do not have a tar. Alternate installations often have a more complete unix toolset.
-
Mark Adler about 11 yearszip can store and restore the Unix permissions. The zip and unzip utilities from InfoZIP normally distributed with Unix system does this.
-
Mark Adler about 11 yearszip does not compress the file in 32K chunks. You are confusing the sliding window size of 32K with how the compression is done.
-
Mark Adler about 11 yearsSo much misinformation in one answer.
-
wim about 11 years-1 for great justice. this should have been a comment.
-
Konrad Rudolph about 11 yearsWhy would you use this rather than
scp
,rsync
, SFTP or any of the other file transfer protocols though? -
michael about 11 yearsI don't buy the legacy/lowest common denominator argument; I remember on new systems (sun) frequently having to download gzip/gunzip (from sunfreeware) just to install other tar.gz packaged software (plus gnu tar, since sun's tar sucked). For legacy/lower-common denominator, you had
tar.Z
(compress/uncompress). The progression of utilities has been a constant stream (no pun intended) of change & improvement: Z => zip => gz => bz2 => 7z => xz (or whatever order you prefer). As for tar's role, some utils un/compress only, and still require tar to bundle up file hierarchies. -
michael about 11 yearshaving used other tar's, gnu tar is the only one I would trust to work consistently & correctly. Especially on solaris, but also a bit cautious with native (proprietary) tar's on hp-ux/aix & z/os.
-
Carlos Campderrós about 11 years@Stu just to clarify, is not an issue, is just optimizing your time (I don't care about space if that's what you thought)
-
michael about 11 yearsI use
tar
on the other end (the sending side, rather than the receiving side), since gnu tar has really flexible options for including/excluding files, over, sayscp -r
, e.g,.tar -czh --exclude=.svn --exclude=.git --exclude=*~ --exclude=*.bak -f - some_dir | ssh user@rmt_host "cat > ~/some_dir.tgz"
(avoids creating local tar.gz before sending, too) -
Carlos Campderrós about 11 yearsBoth sides work: You can tar on one side and untar in the other, too:
tar zc /some/folder | ssh user@host "cd /other/folder && tar zx"
-
xorsyst about 11 yearsThat may be implementation-specific then, it certainly isn't supported by the original pkzip.
-
Massey101 about 11 yearsDownvote. Sarcasm is inappropriate on Stackexchange. People do actually trust these answers.
-
Ярослав Рахматуллин about 11 yearsI'm not sarcastic. I like RMS and the way he carries forth his believes.
-
Mark Adler about 11 yearsYes, the software has to be written to support it. The zip format supports it completely, with data descriptors that can follow the compressed data with the lengths and CRC.
-
Ilmari Karonen about 11 yearsYou don't need GNU tar to use an arbitrary compressor: just tell tar to write the archive to stdout with
f -
and pipe it to the compressor. -
JFW about 11 yearsKudos for great answer with all the content separated under three distinct headers.
-
F. Erken about 11 years@psusi as I remember from old times when using pkzip to store files on multiple floppies, zip store catalog at end of archive. It always request last floppy for start extraction or show catalog. So en.wikipedia.org/wiki/File:ZIP-64_Internal_Layout.svg
-
psusi about 11 years@mmv-ru, oh yea, it is backwards, I forgot about that.
-
psusi about 11 years@michael_n, the progression of compression tools has continued, yet we still use tar as the container format. The question made it clear it was talking about that, not the compression.
-
psusi about 11 years@MarkAdler, it appears zip has been extended to store the file mode, but not owner. 7zip still warns it does not handle unix permissions. Zip ( and cab ) does compress 32k blocks at a time, else it could not efficiently extract a file from the middle of a large archive, which is the problem tar has. 7z, rar, and dar have an option to use the blocking method ( like zip ) or "solid" mode ( like tar ), as they call it. Re: -9, it seems I was thinking of bzip2 and lzma, and gzip uses a more simplistic system, but it does not use a fixed 32k dictionary, but the window limits it near there.
-
psusi about 11 years@MarkAdler, what software? Infozip doesn't support unzipping from a pipe.
-
timonsku about 11 yearsI highly disagree that xz achieves better compression than a .7z archive. The 7-zip file format supports a wide variety of compression algorithms including LZMA(2) which is its "home" compression algorithm and was developed by the the 7-zip developer. From the xz wiki article: "xz is essentially a stripped down version of the 7-Zip program, which uses its own file format rather than the .7z format used by 7-Zip which lacks support for Unix-like file system metadata."
-
allquixotic about 11 yearsXZ uses LZMA2 as its compression algorithm. The only difference is that 7-zip has a different metadata format. The mathematics used to compress the files is exactly the same as LZMA2. Certain input data can yield better compression ratios if you use PPMD compression in 7-zip, but the runtime and memory costs of PPMD far exceed any other compression algorithm in existence, both for compression and decompression.
-
allquixotic about 11 yearsLZMA on the other hand decompresses very fast (almost as fast as zip, and much much faster than it compresses). PPMD, while it may save a few kilobytes on several dozen megabytes of data, will take gigabytes of memory to decompress, and will decompress just as slowly as it compresses (slooooooooooooooooooooooooooow). So, throwing out ppmd as being impractical, Xz and 7-Zip are identical in compression capability, varying insignificantly based on the way they store file structure and metadata.
-
Mark Adler about 11 years
-
Mark Adler about 11 yearsThe zip format can store both the uid and gid.
-
Mark Adler about 11 yearsAlso Info-ZIP's zip supports compression to a stream.
-
michael about 11 years@psusi yeah, I know / understand / agree / etc. And now (gnu) tar compresses, too, in a variety of formats (gz/bz/xz/yada-yada-yada-z): time rolls on, lines blur, things change, and sun's tar still doesn't handle long file/path names. (...arguably for "posix compliance", but no need to delve into pedantry (my fault) and lose the larger point (whatever it was, i forget))
-
kriss about 11 years@Konrad: you can perform that kind of transfer using tar using very simple network tools like netcat. scp, rsync sftp or such implies running much more complex client and server software.
-
Christian about 11 yearsYou practically never see uncompressed
tar
files and there's a reason for that.tar
uses very large chunks, meaning that you get a lot of padding at the end of files. To get rid of all these zeros, it almost always pays to just usegzip
without giving it a second thought. -
slhck about 11 years@titaniumdecoy Have you noticed that it was allquixotic who originally wrote that part and edited it into Kruug's answer?
-
psusi about 11 years@MarkAdler, I once worked on the cab extractor for ReactOS, trust me, it it compresses 32k at a time, either combining smaller files or splitting larger ones as needed.
-
Mark Adler about 11 yearsWhen working on the CAB format, it might have been a good idea to spend some time studying the cab format specification. The 32K CFDATA blocks are not random access entry points. The random access points are at the start of CAB "folders", which consist of a series of CFDATA blocks. From the specification: "By default, all files are added to a single folder (compression history) in the cabinet." So a non-default option would be needed for CAB file to have any random access midpoints at all.
-
Mark Adler about 11 yearsYour edited answer has improved, but is still chock full of misinformation. zip does not compress in 32K chunks, and does not provide access to parts of files without having to decompress the entire file. "It also prevents the compressor from building up a very large dictionary before it is restarted." is nonsensical. There is no building up of anything. The deflate dictionary is simply the 32K bytes that precede the next byte to be compressed. Once you get past the first 32K, the dictionary is always the same size, there is no "building up", and the compression speed does not change.
-
Mark Adler about 11 yearsAn amusing exception is that the gzip source code is available as a naked tar, for obvious reasons.
-
titaniumdecoy about 11 yearsThanks for pointing that out, I didn't notice. However it seems a bit silly to me to have an identical block of text in two different answers on this page.
-
ctype.h about 11 yearsCW stands for Community Wiki. See also What are "Community Wiki" posts?.
-
ctype.h about 11 yearsI guess it is CW because the question has more than 15 answers. When you posted this answer, because it is the 15th, the question and all of the answers were marked CW.
-
psusi about 11 yearsBecause the data stream is broken into a series of CFDATA blocks that are limited in size, that does in fact, provide for random access, since you can seek to any CFDATA block and start decompression there. The folder mechanism is a seemingly useless abstraction. As I said, the deflate dictionary is not strictly limited to 32k, though in practice it tends to not grow much larger due to the 32k distance limit, but inf-zip allows for bzip2, which has no such limit. Whatever the limits of the compression algorithm, restarting it does reduce compression ratios.
-
Mark Adler about 11 yearsNo, you cannot start decompressing at any CFDATA block. Read the specification, which is very clear on this point. Within a folder, each CFDATA block can and does use the previous CFDATA blocks as history for compression. The folder is the only abstraction in the specification that defines where you can start decompressing, so it is not only useful, but essential for the random extraction application you are calling attention to in your answer.
-
Mark Adler about 11 yearsThe deflate dictionary is strictly limited to 32K. It does not "grow" once you're at least 32K into the stream. From there on it is always exactly 32K. bzip2 certainly does have a limit of 900K of history, which is not a sliding dictionary but rather a block on which the BWT transform is applied. Each block is compressed independently, and cannot make use of the information in previous blocks.
-
Mark Adler about 11 yearsSince there seems to be no limit to the amount of misinformation you can fabricate, this is no longer productive. I am done commenting on this answer and related comments. Thank you and good night.
-
MarcusJ about 11 yearsReally good comment, I hadn't even thought of that, and that's a REALLY good point to make.
-
allquixotic about 11 yearsI fail to see how this answer says something that none of the other answers do, other than directly quoting the questions (which I wrote, BTW, because the original revision of the question was horrible enough to be closed as NARQ). Nice try though.
-
Mark Adler about 11 yearsUm, ok. Whatever you'd like to think is fine. Your answer nor any other answer seems to address whether there is a performance penalty. Your answer does not address the noticeable compression difference, though others do. Since yours does not actually address performance (your performance section is actually about workflow, nothing about performance), no other answer answers everything in one place. It is interesting that you wrote the performance penalty question, but you did not answer it! Go figure.
-
Mark Adler about 11 yearsBy the way, your workflow discussion is about something no one ever does, which is to write a tar file to a disk and then compress it. tar is always used either calling the compression program directly, or directly into a pipe to a compression program.
-
Warren P about 11 yearsSarcasm and sincerity travels poorly across plain text. We usually guess about it from your tone of voice. Guessing someone is serious or not on the internet is a bit difficult.
-
Ярослав Рахматуллин about 11 years@Warren P thanks for the comment. I'll try to maintain a neutral tone in the future.
-
Lajos Veres about 10 yearsI don't have enough repo to add an answer so I write here: AFAIK tar's fault tolerance is much higher than other similar tools. If you have to save something from a not so reliable medium (for example network fs.) tar is probably the best tool for saving as much data as it is possible. Rsync and other tools failed when the first error happened, but with tar we were able to pass single errors. (It was a not so critical daily backup.)
-
Steve over 9 years"Do you want uncompressed data (why?)?" I use
tar
very often to copy a filesystem tree from one place to another and preserve permissions, etc., and compression in this case just takes extra CPU cycles. E.g.tar cf - * | tar xf - -C /somewhere
. -
Aaron over 9 yearsAdditionally, you would want a .tar file when the destination filesystem performs de-duplication. Creating compressed archives on a filesystem that performs de-duplication will substantially lower the dedupe ratio. Example: We once deleted a $10,000.00 tar.gz file; meaning, it was taking up $10k worth of storage space because someone used compression.
-
underscore_d over 8 years^ i think "higher fault tolerance" there actually means 'does not notice errors and will blindly stumble on, whether you want it to or not'!
-
agc almost 8 yearsRe "So obsessed"... imagine you're stranded in warzone with a single hardened laptop, and the undersized 20G hard drive's nearly full, maybe a Gig left, and hearing the gunfire from far off, you'd really like to browse a 100MB .PDF manual that shows how to repair the jeep, but the file is in a 2 Gig .tgz file. And the laptop runs a closed source strange proprietary OS, and you don't have root access to delete system files, not that it'd be obvious how to delete 4G+ without breaking the dearchiver or the PDF viewer. If you could just extract that 100MB file...
-
gaborous almost 8 yearsI disagree that dar doesn't provide enough of a benefit to justify the change: it is way more robust and way less susceptible to corruption (ie, tar is a solid archive whereas dar is not and thus allows partial file extraction, even if corrupted, whereas you lose all your files in a corrupted tar). In addition, most modern archiving features are natively supported, such as encrypting. So certainly the benefits are huge, and certainly justify the change. So the reason it has not been more widely adopted yet is to find elsewhere (lack of easy GUI tools? Inertia?).
-
gaborous almost 8 yearsSo I stand with @MarkAdler, this answer is based on incorrect premisses: tar does not allow partial file extraction, in fact it's the opposite, if you tar your files before feeding them to zip/Deflate, you lose the ability to partially extract files without uncompressing the archive, because tar can only make solid archives.
-
gaborous almost 8 yearsThis answer is the only one that makes sense. Thank you for posting it.
-
gaborous almost 8 yearsThis answers why
tar
fits in the archiving ecosystem (ie, to aggregate files together, providing a performance boost and some other benefits like permissions saving), but it does not address why modern alternatives such asdar
aren't used in place. In other words, this answer justifies the usage of files aggregators, but not of thetar
software in itself. -
phuclv almost 6 years@Steve CPU cycles may be cheaper than disk IO for algorithms like LZ4 or LZO. That's why they're used in zram, and transparent compression file systems like NTFS, ZFS, Btrfs... so sometimes it's actually faster than to compress since the amount of disk IO is greatly reduced