How to check if a file is gzip compressed?

26,945

Solution 1

There is a magic number at the beginning of the file. Just read the first two bytes and check if they are equal to 0x1f8b.

Solution 2

Do you prefer false positives, false negatives, or no false results at all (there goes performance down the drain...)?

The RFC 1952: GZIP file format specification version 4.3 states the first 2 bytes (of each member and therefore) of the file are '\x1F' and '\x8B'. Use that for a first check that can result in false positives.

Solution 3

What is the difference in performance between reading compressed and uncompressed files using gzread()?

Anyway, in order to detect if a file is gzipped, you can read the magic number at the beginning of the file, which is 1f 8b according to the link.

Solution 4

You can test for the signatures described in the RFCs 1951 and 1952 to get an idea. For GZIP files the second one is the relevant and it is definitive. There are some false positives on other formats, so you should check as much of the header for plausible values.

For just zlib streams it's somewhat harder, because they are even more prone to false positives. But you would rarely encounter those in the wild on their own.

Share:
26,945
Deepak Prakash
Author by

Deepak Prakash

Updated on December 15, 2020

Comments

  • Deepak Prakash
    Deepak Prakash over 3 years

    I have a C / C++ program which needs to read in a file that may or may not be gzip compressed. I know we can use gzread() from zlib to read in both compressed and uncompressed files - however, I want to use the zlib functions ONLY if the file is gzip compressed (for performance reasons).

    So is there any way to programatically detect or check if a certain file is gzipped from C / C++?

  • pmg
    pmg almost 13 years
    Beware endianness and byte width. Compare individual values rather than a composite: (byte1 == 0x1f) && (byte2 == 0x8b) versus first2bytes == 0x1f8b.
  • Deepak Prakash
    Deepak Prakash almost 13 years
    Regarding performance: There is huge difference - 1min (fread) vs 20mins (gzread) for uncompressed files. Might have to do with us using an older version of zlib, but right now I'm not in a position to use the latest version - so have to do the conditional read to work around this.