Download and decompress gzipped file in memory?
48,456
Solution 1
You need to seek to the beginning of compressedFile
after writing to it but before passing it to gzip.GzipFile()
. Otherwise it will be read from the end by gzip
module and will appear as an empty file to it. See below:
#! /usr/bin/env python
import urllib2
import StringIO
import gzip
baseURL = "https://www.kernel.org/pub/linux/docs/man-pages/"
filename = "man-pages-3.34.tar.gz"
outFilePath = "man-pages-3.34.tar"
response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())
#
# Set the file's current position to the beginning
# of the file so that gzip.GzipFile can read
# its contents from the top.
#
compressedFile.seek(0)
decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
with open(outFilePath, 'w') as outfile:
outfile.write(decompressedFile.read())
Solution 2
For those using Python 3, the equivalent answer is:
import urllib.request
import io
import gzip
response = urllib.request.urlopen(FILE_URL)
compressed_file = io.BytesIO(response.read())
decompressed_file = gzip.GzipFile(fileobj=compressed_file)
with open(OUTFILE_PATH, 'wb') as outfile:
outfile.write(decompressed_file.read())
Solution 3
If you have Python 3.2 or above, life would be much easier:
#!/usr/bin/env python3
import gzip
import urllib.request
baseURL = "https://www.kernel.org/pub/linux/docs/man-pages/"
filename = "man-pages-4.03.tar.gz"
outFilePath = filename[:-3]
response = urllib.request.urlopen(baseURL + filename)
with open(outFilePath, 'wb') as outfile:
outfile.write(gzip.decompress(response.read()))
For those who are interested in history, see https://bugs.python.org/issue3488 and https://hg.python.org/cpython/rev/3fa0a9553402.
Comments
-
OregonTrail almost 4 years
I would like to download a file using urllib and decompress the file in memory before saving.
This is what I have right now:
response = urllib2.urlopen(baseURL + filename) compressedFile = StringIO.StringIO() compressedFile.write(response.read()) decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb') outfile = open(outFilePath, 'w') outfile.write(decompressedFile.read())
This ends up writing empty files. How can I achieve what I'm after?
Updated Answer:
#! /usr/bin/env python2 import urllib2 import StringIO import gzip baseURL = "https://www.kernel.org/pub/linux/docs/man-pages/" # check filename: it may change over time, due to new updates filename = "man-pages-5.00.tar.gz" outFilePath = filename[:-3] response = urllib2.urlopen(baseURL + filename) compressedFile = StringIO.StringIO(response.read()) decompressedFile = gzip.GzipFile(fileobj=compressedFile) with open(outFilePath, 'w') as outfile: outfile.write(decompressedFile.read())
-
OregonTrail about 11 yearsTurns out I could have taken advantage of StringIO's
__init__
, see updated question. -
crayzeewulf about 11 yearsYeah. That works even better. :) I will leave my answer unedited as you've already added the updated answer. Thanks.
-
jfs almost 9 years@OregonTrail: or you could cut out the middleman and pass
response
directly. btw, don't put answers into the question; you are encouraged to post your own answer. -
jfs almost 9 yearsit won't work: you are trying to write bytes into a text file; use binary mode instead. Try:
copyfileobj(GzipFile(fileobj=response), open(outfile_path, 'wb'))
-
khamaileon over 2 yearsBest solution so far.