Best compression algorithm for XML?

40,226

Solution 1

There is a W3 (not-yet-released) standard named EXI (Efficient XML Interchange).

Should become THE data format for compressing XML data in the future (claimed to be the last necessary binary format). Being optimized for XML, it compresses XML more ways more efficient than any conventional compression algorithm.

With EXI, you can operate on compressed XML data on the fly (without the need to uncompress or re-compress it).

EXI = (XML + XMLSchema) as binary.

And here you go with the opensource implementation (don't know if it's already stable):
Exificient

Solution 2

Another alternative to "compress" XML would be FI (Fast Infoset).

XML, stored as FI, would contain every tag and attribute only once, all other occurrences are referencing the first one, thus saving space.

See:

Very good article on java.sun.com, and of course
the Wikipedia entry

The difference to EXI from the compression point of view is that Fast Infoset (being structured plaintext) is less efficient.

Other important difference is: FI is a mature standard with many implementations.
One of them: Fast Infoset Project @ dev.java.net

Solution 3

Yes, *.zip best in practice. Gory deets contained in this USENIX paper showing that "optimal" compressors not worth computational cost & domain-specific compressors don't beat zip [on average].

Disclaimer: I wrote that paper, which has been cited 60+ times according to Google.

Solution 4

It seems like you're more interested in compression rather than encryption. Is that the case? If so, this might prove an interesting read even though is not an exact solution.

Solution 5

By the way, the scenario is this: I am creating a standard for documents, like ODF or MS Office XML, that contain XML files, packaged in a .zip.

then I'd suggest you use .zip compression, or your users will get confused.

Share:
40,226
Aethex
Author by

Aethex

Updated on April 30, 2020

Comments

  • Aethex
    Aethex about 4 years

    I barely know a thing about compression, so bear with me (this is probably a stupid and painfully obvious question).

    So lets say I have an XML file with a few tags.

    <verylongtagnumberone>
      <verylongtagnumbertwo>
        text
      </verylongtagnumbertwo>
    </verylongtagnumberone>
    

    Now lets say I have a bunch of these very long tags with many attributes in my multiple XML files. I need to compress them to the smallest size possible. The best way would be to use an XML-specific algorithm which assigns individual tags pseudonyms like vlt1 or vlt2. However, this wouldn't be as 'open' of a way as I m trying to go for, and I want to use a common algorithm like DEFLATE or LZ. It also helpes if the archive was a .zip file.

    Since I'm dealing with plain text (no binary files like images), I'd like an algorithm that suits plain text. Which one produces the smallest file size (lossless algorithms are preferred)?

    By the way, the scenario is this: I am creating a standard for documents, like ODF or MS Office XML, that contain XML files, packaged in a .zip.

    EDIT: The 'encryption' thing was a typo; it should ave ben 'compression'.

  • J-16 SDiZ
    J-16 SDiZ almost 15 years
    Ugh.. XML was designed because "binary files are evil". And we now have these EXI stuff. This proof XML was just reinventing the wheel. Shouldn't we have used ASN.1?
  • ivan_ivanovich_ivanoff
    ivan_ivanovich_ivanoff almost 15 years
    Some substandard (or something) of ASN.1 was an candidate for EXI. Binary files are evil. EXI is not a binary file in common sense. You don't need to write own implementation to read/write this binary file, nor you have to define own structure and type system. All done for you by XML+XmlSchema.
  • Steven Sudit
    Steven Sudit about 13 years
    We should probably mention that the reason EXI won out over FI is that, when there's a schema, it can contain tags and attributes ZERO times instead of once.
  • Steven Sudit
    Steven Sudit about 13 years
    Yes, plus zipping compressing XML isn't going to yield any further compression.
  • Brady Moritz
    Brady Moritz about 8 years
    JSON really is not any smaller than xml though
  • unbob
    unbob over 4 years
    old link seems to be dead; new link, courtesy of archive.org and google: gnosis.cx/publish/programming/xml_matters_13.html