How can I evaluate the best choice of archive format for compressing files?

5,201

Solution 1

There are a large variety of compression formats and methods available, some don't compress at all and are designed to store a number of files in one archive, and other newer experimental compressors (PAQ based) are designed to compress as aggressively as possible, regardless of the time it takes to perform said operation.

You need to evaluate the features you require from your compression method choice, and also consider the context in which it will be used.

Different features and considerations include:

  • Compression ability - Does it shrink the file significantly enough?
  • Ease-of-use - If the file is going to another user, will the archive be easy to extract or will it require more software to be installed?
  • Password protection and/or encryption - Are these security measures required?
  • Multiple volumes support - If the target medium requires the file to be split into appropriate chunks, does the format support this elegantly. For example, 650 MB for a CD.
  • Repairing and recovery - If the file becomes partially corrupt, does it offer a recovery record to aid restoration of data?
  • Unicode support - Does the archiver support international file names or just standard ASCII?
  • System Requirements - Modern compressors such as 7-Zip do offer the ability to increase compression efficiency by using a larger dictionary (a dictionary is a reference of commonly repeated data in a compressed file), but this in turn increases memory consumption at both compression and decompression time.
  • Self-extraction support - Can the archive be rolled into an executable file that provides ease of use to whomever needs to use it? (Also bear in mind you can only create a self-extractor for a single platform. Generally speaking a Windows self-extractor will not work on Linux by default, unless run through a compatibility layer like Wine).
  • File system attributes - Does the compressor store relevant file system metadata and permissions that may be worth preserving at point of extraction?

Generally speaking ZIP is the most ubiquitous format, but sizes over 4 GB aren't generally supported (if at all), security support is generally regarded as poor (the standard password can be compromised with a plain-text attack, and further encryption is generally implemented as an unofficial derivative of the format by commercial ZIP software vendors).

Apart from that, most other popular formats will have some form of support on all operating systems by installing more software.

My personal choice is 7-Zip, as it has great and flexible compression; despite it having a peculiar user interface on Windows. There are de-compressors for Linux and Mac OS X (although not GUI based as standard).

Solution 2

One things that comes to mind is a (two year old) blog post from Jeff Atwood: File Compression in the Multi-Core Era. In that article he finds that bzip2 outperforms 7-zip when running more than two cores.

Solution 3

As others have mentioned, the choice of a particular compression format is heavily dependent on the use and the intended audience.

  • .tar.gz and tar.bz2 archives are ideal for use on Linux systems (and by extension for sharing files with Linux users) because the tar, gzip and bzip2 tools are largely ubiquitous on the platform, and because the .tar format has full support for Unix permissions and other platform-specific properties. The choice between gzip and bzip2 to compress the tar archive is mainly a decision about speed versus compression ratio, with bzip2 delivering smaller files but with a much slower compression speed. The disadvantages of these formats include less compatibility with Windows and the (potential) need to uncompress the entire archive to extract a single file.

  • ZIP archives can be extracted on most platforms using native tools, so it is an ideal choice for sending an archive to a non-technical user who would be uncomfortable with installing third-party archive software such as 7-Zip. The compression level isn't as good as more advanced algorithms and it doesn't support Unix permissions, but it is an excellent format if you wanted to send an archive of holiday photos to your grandmother, for example. ZIP also provides some basic password protection, and can quickly extract a file from anywhere in the archive.

  • 7-Zip is good if you want the best possible compression ratios. Like ZIP, it doesn't support Unix file permissions or ownership, and is also not installed by default on most platforms which makes it slightly more work to use, but it may be worth it on Windows if the compression ratio gains are important. In an all-Linux environment it would be better to use the 'xz' or 'lzma' compression tools along with tar, which operate in exactly the same way as 'gzip' and 'bzip2' but use the more advanced LZMA algorithm like 7-Zip.

Solution 4

To you first question, 7-Zip is an archiver than can use many algorithms to compress and decompress data.

To your second question, just make sure that the platform supports tools that support the given format. For example, I would avoid using RAR on a Mac. While it is possible to use, and there are free utilities that support it, they lack the much richer interface that Windows utilities that support RAR have (in my experience).

Solution 5

Just as an example, I use the mentioned formats in these cases:

  • Text files (logs especially): bz2
  • Collection of files to be distributed (e.g. source code): gz (tar.gz really).
  • Assorted files: 7zip. I can compress almost anything in a very efficient way. Cross-platform, open-source, stable, lightweight, file (header and data) encryption,... Can you ask for anything else? :)

I avoid RAR altogether and whenever I receive RAR file from someone I know, I tell him/her to stop using that format since it is propietary, and that probably he/she is using unlicensed software (most people download WinRAR's trial version and keep using it forever).

PS: I run Ubuntu (primarily) and Windows (both dual boot and VirtualBox).

Share:
5,201

Related videos on Youtube

user541686
Author by

user541686

Updated on September 18, 2022

Comments

  • user541686
    user541686 over 1 year

    In general, I've observed the following:

    • Linux-y files or tools use bzip2 or gzip for distributing archives
    • Windows-y files or tools use ZIP for distributing archives
    • Many people use 7-Zip for creating and distributing their own archives

    Questions:

    • What are the advantages and disadvantages of these formats, all of which appear to be open formats? When/why should I choose one (say, 7-Zip) over another (say, ZIP)?
    • Why does the trend above appear to hold, even though all of these are portable formats? Are there any particular advantages to using a particular archive format on a particular platform?
    • 100rabh
      100rabh almost 13 years
    • user541686
      user541686 almost 13 years
      @Sathya, @Andreas: Thanks for the links, those are helpful and answer parts of my question. :)
    • UNK
      UNK almost 13 years
      Compression is a pretty complex field, and no one algorithm can produce optimal results for everything - furthermore, it's a problem you can throw resources at and get better results, but also one that can be done almost as well in much less time. Some algorithms focus on being fast and memory light, some focus on producing the smallest possible file regardless of how long it takes or whether you need 12GB RAM (not exaggerating) to do it, so on.
    • Yitzchak
      Yitzchak almost 13 years
      @Phoshi, this should be an answer.
    • UNK
      UNK almost 13 years
      @Yitz; I think @Ruairi's answer covers the specifics pretty well, and it doesn't really answer the question - just answered why the question could be asked at all.
    • cwd
      cwd almost 13 years
      two notes / gotchas on linux systems: remember that by .tar doesn't really have compression, it just sticks all files into one - which is why you usually see .tar.gz types of files. Also, gzip and gunzip behave differently than zip; zip will leave the originalfiles behind after (de)compressing, where as gzip will sort of "convert" them. in a folder with only test.txt, "gzip test.txt" results in one file "test.txt.gz", and gunzip "test.txt.gz" also leaves the folder with just one file, test.txt.
    • Yitzchak
      Yitzchak almost 13 years
      @phoshi, you're right.
  • hammar
    hammar almost 13 years
    If the archive is meant for distribution, it's also important to consider your target audience and use a format that's supported by default on their platform. Accessibility may be more important than the other considerations in this case.
  • CarlF
    CarlF almost 13 years
    Whereas I personally hate the graphical rar programs and always use the command line, even on Windows.
  • user541686
    user541686 almost 13 years
    +1 thanks for the information, though it would've been even better to mention which formats support those bullet points. :)
  • user541686
    user541686 almost 13 years
    +1 omg! I didn't know that. The compression ratio seems to not be worth it, though. :)
  • cregox
    cregox almost 13 years
    That post is more than 2 years old. Doesn't 7-zip work better with more than two cores now?
  • Ruairi Fullam
    Ruairi Fullam almost 13 years
    BZIP2 compresses more efficiently over multiple cores because it compresses into 100-900KB blocks, thus can spread blocks over separate cores, but the compression efficiency is lost as these blocks are considered to be distinct from each other.
  • Ruairi Fullam
    Ruairi Fullam almost 13 years
    I was tempted but there are a multitude of formats available, which would take a long time to list. Wikipedia does have a good feature matrix of compression formats which may help: en.wikipedia.org/wiki/Comparison_of_archive_formats
  • Ruairi Fullam
    Ruairi Fullam almost 13 years
    That point is certainly debatable, I've not encountered that particular problem but I can see it occurring; I suppose it's all a question of the end goal of creating the archive and the expected longevity of the files use. Certainly if you have an old archive that's difficult to extract from the DOS era, you could use DOSBox, or even create a VM if needed.
  • gaborous
    gaborous over 7 years
    Zip is the most future proof solution and is advised by the UK's National Archive because it is non-solid and very stable compared to gzip, tar or 7-zip.