Download and decompress gzipped file in memory?

48,456

Solution 1

You need to seek to the beginning of compressedFile after writing to it but before passing it to gzip.GzipFile(). Otherwise it will be read from the end by gzip module and will appear as an empty file to it. See below:

#! /usr/bin/env python
import urllib2
import StringIO
import gzip

baseURL = "https://www.kernel.org/pub/linux/docs/man-pages/"
filename = "man-pages-3.34.tar.gz"
outFilePath = "man-pages-3.34.tar"

response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())
#
# Set the file's current position to the beginning
# of the file so that gzip.GzipFile can read
# its contents from the top.
#
compressedFile.seek(0)

decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')

with open(outFilePath, 'w') as outfile:
    outfile.write(decompressedFile.read())

Solution 2

For those using Python 3, the equivalent answer is:

import urllib.request
import io
import gzip

response = urllib.request.urlopen(FILE_URL)
compressed_file = io.BytesIO(response.read())
decompressed_file = gzip.GzipFile(fileobj=compressed_file)

with open(OUTFILE_PATH, 'wb') as outfile:
    outfile.write(decompressed_file.read())

Solution 3

If you have Python 3.2 or above, life would be much easier:

#!/usr/bin/env python3
import gzip
import urllib.request

baseURL = "https://www.kernel.org/pub/linux/docs/man-pages/"
filename = "man-pages-4.03.tar.gz"
outFilePath = filename[:-3]

response = urllib.request.urlopen(baseURL + filename)
with open(outFilePath, 'wb') as outfile:
    outfile.write(gzip.decompress(response.read()))

For those who are interested in history, see https://bugs.python.org/issue3488 and https://hg.python.org/cpython/rev/3fa0a9553402.

Share:
48,456
OregonTrail
Author by

OregonTrail

Just a guy

Updated on July 18, 2020

Comments

  • OregonTrail
    OregonTrail almost 4 years

    I would like to download a file using urllib and decompress the file in memory before saving.

    This is what I have right now:

    response = urllib2.urlopen(baseURL + filename)
    compressedFile = StringIO.StringIO()
    compressedFile.write(response.read())
    decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
    outfile = open(outFilePath, 'w')
    outfile.write(decompressedFile.read())
    

    This ends up writing empty files. How can I achieve what I'm after?

    Updated Answer:

    #! /usr/bin/env python2
    import urllib2
    import StringIO
    import gzip
    
    baseURL = "https://www.kernel.org/pub/linux/docs/man-pages/"        
    # check filename: it may change over time, due to new updates
    filename = "man-pages-5.00.tar.gz" 
    outFilePath = filename[:-3]
    
    response = urllib2.urlopen(baseURL + filename)
    compressedFile = StringIO.StringIO(response.read())
    decompressedFile = gzip.GzipFile(fileobj=compressedFile)
    
    with open(outFilePath, 'w') as outfile:
        outfile.write(decompressedFile.read())
    
  • OregonTrail
    OregonTrail about 11 years
    Turns out I could have taken advantage of StringIO's __init__, see updated question.
  • crayzeewulf
    crayzeewulf about 11 years
    Yeah. That works even better. :) I will leave my answer unedited as you've already added the updated answer. Thanks.
  • jfs
    jfs almost 9 years
    @OregonTrail: or you could cut out the middleman and pass response directly. btw, don't put answers into the question; you are encouraged to post your own answer.
  • jfs
    jfs almost 9 years
    it won't work: you are trying to write bytes into a text file; use binary mode instead. Try: copyfileobj(GzipFile(fileobj=response), open(outfile_path, 'wb'))
  • khamaileon
    khamaileon over 2 years
    Best solution so far.