In Python, how do I decode GZIP encoding?

71,070

Solution 1

I use zlib to decompress gzipped content from web.

import zlib
import urllib

f=urllib.request.urlopen(url) 
decompressed_data=zlib.decompress(f.read(), 16+zlib.MAX_WBITS)

Solution 2

Decompress your byte stream using the built-in gzip module.

If you have any problems, do show the exact minimal code that you used, the exact error message and traceback, together with the result of print repr(your_byte_stream[:100])

Further information

1. For an explanation of the gzip/zlib/deflate confusion, read the "Other uses" section of this Wikipedia article.

2. It can be easier to use the zlib module than the gzip module if you have a string rather than a file. Unfortunately the Python docs are incomplete/wrong:

zlib.decompress(string[, wbits[, bufsize]])

...The absolute value of wbits is the base two logarithm of the size of the history buffer (the “window size”) used when compressing data. Its absolute value should be between 8 and 15 for the most recent versions of the zlib library, larger values resulting in better compression at the expense of greater memory usage. The default value is 15. When wbits is negative, the standard gzip header is suppressed; this is an undocumented feature of the zlib library, used for compatibility with unzip‘s compression file format.

Firstly, 8 <= log2_window_size <= 15, with the meaning given above. Then what should be a separate arg is kludged on top:

arg == log2_window_size means assume string is in zlib format (RFC 1950; what the HTTP 1.1 RFC 2616 confusingly calls "deflate").

arg == -log2_window_size means assume string is in deflate format (RFC 1951; what people who didn't read the HTTP 1.1 RFC carefully actually implemented)

arg == 16 + log_2_window_size means assume string is in gzip format (RFC 1952). So you can use 31.

The above information is documented in the zlib C library manual ... Ctrl-F search for windowBits.

Solution 3

For Python 3

Try out this:

import gzip

fetch = opener.open(request) # basically get a response object
data = gzip.decompress(fetch.read())
data = str(data,'utf-8')

Solution 4

I use something like that:

f = urllib2.urlopen(request)
data = f.read()
try:
    from cStringIO import StringIO
    from gzip import GzipFile
    data2 = GzipFile('', 'r', 0, StringIO(data)).read()
    data = data2
except:
    #print "decompress error %s" % err
    pass
return data

Solution 5

If you use the Requests module, then you don't need to use any other modules because the gzip and deflate transfer-encodings are automatically decoded for you.

Example:

>>> import requests
>>> custom_header = {'Accept-Encoding': 'gzip'}
>>> response = requests.get('https://api.github.com/events', headers=custom_header)
>>> response.headers
{'Content-Encoding': 'gzip',...}
>>> response.text
'[{"id":"9134429130","type":"IssuesEvent","actor":{"id":3287933,...

The .text property of the response is for reading the content in the text context.

The .content property of the response is for reading the content in the binary context.

See the Binary Response Content section on docs.python-requests.org

Share:
71,070

Related videos on Youtube

TIMEX
Author by

TIMEX

Updated on August 20, 2020

Comments

  • TIMEX
    TIMEX over 3 years

    I downloaded a webpage in my python script. In most cases, this works fine.

    However, this one had a response header: GZIP encoding, and when I tried to print the source code of this web page, it had all symbols in my putty.

    How do decode this to regular text?

  • DataGreed
    DataGreed almost 12 years
    could you please comment what the second argument for decompress is for? And why is it so...well... strange?
  • YOU
    YOU almost 12 years
    @DataGreed, Second argument is for window size, please refer to John Machin's Answer below.
  • soulmachine
    soulmachine almost 9 years
    Awesome! You saved my life :D
  • jfs
    jfs almost 9 years
    you could use GzipFile to decompress on-the-fly. Note: earlier versions of GzipFile've used seek() but it is fixed now.
  • m3nda
    m3nda almost 8 years
    @SamP I think because doesn't work as is, doesn't explain anything. I've you try it u'll get few errors for sure. That's not an valid answer at all.
  • WeizhongTu
    WeizhongTu almost 8 years
    I just like requests package, which could handle gzip automatically
  • WeizhongTu
    WeizhongTu almost 8 years
    @SamP just use urllib3.response.GzipDecoder().decompress(gzip_data)
  • Ciprian Tomoiagă
    Ciprian Tomoiagă over 7 years
    @WeizhongTu can you make requests handle gzip when it is not specified in the headers ?
  • Luc
    Luc over 4 years
    Note that the "opener" thing is optional, to decode a file you can just: data = gzip.decompress(open("my.file", "rb").read()); mystring = str(data, "utf-8")