Compression formats with good support for random access within archives?

compression gzip archive zlib random-access

25,825

Solution 1

I don't know of any compressed file format which would support random access to a specific location in the uncompressed data (well, except for multimedia formats), but you can brew your own.

For example, bzip2 compressed files are composed of independent compressed blocks of size <1MB uncompressed, which are delimited by sequences of magic bytes, so you could parse the bzip2 file, get the block boundaries and then just uncompress the right block. This would need some indexing to remember where do the blocks start.

Still, I think the best solution would be to split your file into chunks of your choice, and then compressing it with some archiver, like zip or rar, which support random access to individual files in the archive.

Solution 2

Take a look at dictzip. It is compatible with gzip and allows coarse random access.

An excerpt from its man page:

dictzip compresses files using the gzip(1) algorithm (LZ77) in a manner which is completely compatible with the gzip file format. An extension to the gzip file format (Extra Field, described in 2.3.1.1 of RFC 1952) allows extra data to be stored in the header of a compressed file. Programs like gzip and zcat will ignore this extra data. However, [dictzcat --start] will make use of this data to perform pseudo-random access on the file.

I have the package dictzip in Ubuntu. Or its source code is in a dictd-*.tar.gz. Its license is GPL. You are free to study it.

Update:

I improved dictzip to have no file size limit. My implementation is under MIT license.

Solution 3

The .xz file format (which uses LZMA compression) seems to support this:

Random-access reading: The data can be split into independently compressed blocks. Every .xz file contains an index of the blocks, which makes limited random-access reading possible when the block size is small enough.

This should be sufficient for your purpose. A drawback is that the API of liblzma (for interacting with these containers) does not seem that well-documented, so it may take some effort figuring out how to randomly access blocks.

Solution 4

Solutions exist for providing random access to gzip and bzip2 archives:

gzip zran.c from the zlib source code
bzip2 Node.JS version of seek-bzip (The original C version by James Taylor seems to have disappeared from the internet...)

(I'm looking for something for 7zip)

Solution 5

bgzip can compress files in a gzip variant which is indexable (and can be decompressed by gzip). This is used in some bioinformatics applications, together with the tabix indexer.

See explanations here: http://blastedbio.blogspot.fr/2011/11/bgzf-blocked-bigger-better-gzip.html, and here: http://www.htslib.org/doc/tabix.html.

I don't know to what extent it is adaptable to other applications.

View more solutions

25,825

Author by

John Zwinck

Updated on July 05, 2022

Comments

John Zwinck almost 2 years

This is similar to a previous question, but the answers there don't satisfy my needs and my question is slightly different:

I currently use gzip compression for some very large files which contain sorted data. When the files are not compressed, binary search is a handy and efficient way to support seeking to a location in the sorted data.

But when the files are compressed, things get tricky. I recently found out about zlib's Z_FULL_FLUSH option, which can be used during compression to insert "sync points" in the compressed output (inflateSync() can then begin reading from various points in the file). This is OK, though files I already have would have to be recompressed to add this feature (and strangely gzip doesn't have an option for this, but I'm willing to write my own compression program if I must).

It seems from one source that even Z_FULL_FLUSH is not a perfect solution...not only is it not supported by all gzip archives, but the very idea of detecting sync points in archives may produce false positives (either by coincidence with the magic number for sync points, or due to the fact that Z_SYNC_FLUSH also produces sync points but they are not usable for random access).

Is there a better solution? I'd like to avoid having auxiliary files for indexing if possible, and explicit, default support for quasi-random access would be helpful (even if it's large-grained--like being able to start reading at each 10 MB interval). Is there another compression format with better support for random reads than gzip?

Edit: As I mentioned, I wish to do binary search in the compressed data. I don't need to seek to a specific (uncompressed) position--only to seek with some coarse granularity within the compressed file. I just want support for something like "Decompress the data starting roughly 50% (25%, 12.5%, etc.) of the way into this compressed file."
Vilx- over 15 years

Or you could TAR them all and end up with a .GZ.TAR. :)
William Brendel over 15 years

That would definitely make things cleaner. I was just trying to go for simplicity here, but your suggestion is well taken :-)
William Brendel over 15 years

Well, yes and no. With fixed size chunks (10 MB in this case), you would not have to walk through a list of headers. This relies on the assumption that the tar will order the files alphabetically (which happens to be the case in GNU-land).
BetweenTwoTests over 15 years

Yes, but the files would not be compressed then (10 MB uncompressed for your indexing expression to work, 10 MB compressed for direct access in tar to work). It's hard to compress anything to a fixed size, although you could make that size sufficiently large and handle excess space with sparse files
Michael Burr over 15 years

If you want the various chunk files to be packaged in a single archive, you could just use an archive format like .zip that supports a directory allowing random access to individual files. William Brendel's key idea is to chunk the original file and compress the chunks independently.
Michael Burr over 15 years

This is a similar concept to what Windows NTFS compression does to allow random access to data - each cluster (or 2K chunk - or some unit) is independently compressed.
John Zwinck over 15 years

I don't need to seek to a specific uncompressed position--only to seek somewhat randomly with some coarse granularity within the compressed file. I don't mind at all if all I can do is say "uncompess the data starting here, about 700MB into this file."
John Zwinck over 15 years

I don't want a lot of separate files (I have quite a lot as it is!). The tar idea is interesting, but not sure I want to have a second container type in each file (I'd have to write some code which handles tar & untar...which is doable though not really what I was aiming for.
William Brendel over 15 years

All those who pointed out that tarring the files would break random access: you are correct. I wasn't thinking :-) I'll remove that point from my answer. Thanks for the observations.
Jonathan Leffler over 15 years

@John Zwinck: Add your comment to your question as an update. Note that given the variable compression of data (some stuff I compress shrinks by 94% or so - usually, except when it only shrinks by about 50% or so), your estimate of where to start decompressing might be very hit and miss.
John Zwinck over 13 years

I solved my problem through the use of gzip sync/flush points, which allow me to scan through the file (doing binary search) just fine. I had to write my own gzip-like program on top of libz, because the standard gzip for whatever reason doesn't include a facility to write sync points. Anyway, this works great in my case, because I do not care about being able to "read starting at byte 10000", only to "read starting about 50% of the way through the file." The dictzip approach does look very interesting, and solves a perhaps more general problem than mine.
John Zwinck over 12 years

Wow, interesting thoughts on an old question of mine. Your first suggestion (squashfs) is not entirely what I would want, because it has implications for remote storage: using a compressed filesystem and compressed SSH connections, you would manage to decompress the data and recompress it to send it over the network. What would be amazing would be something like a compressed filesystem that you could share via NFS. Which I guess is what your clicfs suggestion might yield. Documentation on clicfs seems quite hard to come by (at least by my quick search), but it's promising. Thank you.
Jacek Kaniuk about 10 years

Do I understand correctly that you've used GPL licenced source code as a base for yours MIT licenced program? Its violation of GPL licence!
Ivo Danihelka about 10 years

@JacekKaniuk Sorry for confusing you. The dictzip file format specification was used and reimplemented from scratch in Python.
TJez over 9 years

@JohnZwinck You mentioned "false positives" being a potential problem. Did you solve that? I assume your solution was in c? Sounds like I need exactly the same functionality, any tips or code would be appreciated.
John Zwinck over 9 years

@TroyJ: if you control the writing of the files, false positives are not going to happen often, and when they do you may know it because decompression from those points will fail (and you can try again). If you do not control the writing, things are trickier: standard gzip-writing programs will emit lots of false positives and no true positives. You could retry N times before giving up; in my experience N will only need to be a small number (less than 10) for the system to be reasonably accurate.
tommy.carstensen over 9 years

Where is the documentation? Can I use dictzip to read the last line of my file, if I know either the byte position of the start of the last line in the uncompressed file or the total line count? This is potentially very cool! Does it allow the compressed file to be tabix indexed as would have been possible with a bgzipped file?
Alex Reynolds over 9 years

Updated link to Github site.
tommy.carstensen over 9 years

"BEDOPS also introduces a novel and lossless compression format called Starch that reduces whole-genome BED datasets to ~5% of their original size (and BAM datasets to roughly 35% of their original size)" <-- This is amazing. You should advertise your tool.
Alex Reynolds over 9 years

We wrote a paper: bioinformatics.oxfordjournals.org/content/28/14/1919.abstrac‌t
Alex Reynolds over 9 years

Samtools faidx doesn't compress near as well as Starch, and it requires keeping a second file with the genomic data, but it offers finer indexing and so is more popular. Starch works really well if you need to squeeze out space or you're doing whole-genome work and want to parallelize tasks by chromosome. I'm working on "Starch 2", which will offer base-level interval queries, but that may be a few months out.
tommy.carstensen over 9 years

Compression of bam to 35% is even better than the cram format. I must read the paper when home. I can't believe this is not widely used.
Alex Reynolds over 9 years

Just a note that is complicated by bzip2 block boundaries being within a byte, so it is doable, but there is more bookkeeping required.
Evgeniy Berezovsky about 9 years

You seem to imply that multiple gzip members are necessary to support data > 4GB, which isn't true. Or are you referring to some dictzip limit? dictzip however states it can store 58969 bytes per entry, or about 1.8GB if the 2-byte entries are used. If this becomes a limiting factor, another format version can be selected and defined for 4-byte entries
hoxnox almost 9 years

I wrote stdio-like library and multithreaded compression utility. Sources are available on github: github.com/hoxnox/csio
Adam Katz over 8 years

@hoxnox - does your csio utility allow random reading of pre-existing gzipped data, or just gzipped data created by csio?
Adam Katz over 8 years

@JohnZwinck - it sounds like you solved the problem. Could share your code?
John Zwinck over 8 years

@AdamKatz: I cannot share the code, partly because it's tightly integrated with a proprietary data format, so nobody would have use for it directly. However, the idea is to write "full sync points" every so often when compressing (say once per MB), then make your reader scan for these points and verify that the messages make sense when you decompress. The difficulties are mostly (1) the standard gzip tool doesn't have an option to insert full sync points at all, (2) you need to write your own heuristic to verify valid messages when resuming.
hoxnox over 8 years

@AdamKatz - gzipped data created by csio or dictzip
Stephane Chazelas almost 8 years

Yes, that's used for instance by pixz for random access of members of tar archives, or nbdkit for accessing xz compressed files as nbd devices (to be able to mount compressed disk images for instance). qcow2 (native format for qemu disk images) is another format that allows compression and random access.
malthe about 5 years

From the information in the original question, SquashFS is exactly what you're asking for. It would of course be ideal if you didn't have to decompress and recompress over a network, but if your SquashFS is set up with a fast decompression algorithm, then the total cost of the decompress + compress is presumably negligible.
jena about 2 years

FYI: tabix does indexing and access by biological coordinates (e.g. chromosome + nucleotide position) while grabix does indexing and access by fille coordinates (e.g. line numbers). They both work great for tabular data, other data may be tricky.
jena about 2 years

BTW Heng Li has a nice backstory to bgzip and the BGZF format on his blog and even a piece of timeline of the ideas. It's kind of mindblowing how it all happened within the span of two months.