How do you unzip very large files in python?
Solution 1
Here's an outline of decompression of large files.
import zipfile
import zlib
import os
src = open( doc, "rb" )
zf = zipfile.ZipFile( src )
for m in zf.infolist():
# Examine the header
print m.filename, m.header_offset, m.compress_size, repr(m.extra), repr(m.comment)
src.seek( m.header_offset )
src.read( 30 ) # Good to use struct to unpack this.
nm= src.read( len(m.filename) )
if len(m.extra) > 0: ex= src.read( len(m.extra) )
if len(m.comment) > 0: cm= src.read( len(m.comment) )
# Build a decompression object
decomp= zlib.decompressobj(-15)
# This can be done with a loop reading blocks
out= open( m.filename, "wb" )
result= decomp.decompress( src.read( m.compress_size ) )
out.write( result )
result = decomp.flush()
out.write( result )
# end of the loop
out.close()
zf.close()
src.close()
Solution 2
As of Python 2.6, you can use ZipFile.open()
to open a file handle on a file, and copy contents efficiently to a target file of your choosing:
import errno
import os
import shutil
import zipfile
TARGETDIR = '/foo/bar/baz'
with open(doc, "rb") as zipsrc:
zfile = zipfile.ZipFile(zipsrc)
for member in zfile.infolist():
target_path = os.path.join(TARGETDIR, member.filename)
if target_path.endswith('/'): # folder entry, create
try:
os.makedirs(target_path)
except (OSError, IOError) as err:
# Windows may complain if the folders already exist
if err.errno != errno.EEXIST:
raise
continue
with open(target_path, 'wb') as outfile, zfile.open(member) as infile:
shutil.copyfileobj(infile, outfile)
This uses shutil.copyfileobj()
to efficiently read data from the open zipfile object, copying it over to the output file.
Marc Novakowski
I'm a Distinguished Software Engineer at SiriusXM/Pandora. I've been involved in many projects and features since 2005 including lead developer for the BlackBerry app, 2012 HTML5 site redesign, and for the latest playback APIs.
Updated on June 06, 2022Comments
-
Marc Novakowski almost 2 years
Using python 2.4 and the built-in
ZipFile
library, I cannot read very large zip files (greater than 1 or 2 GB) because it wants to store the entire contents of the uncompressed file in memory. Is there another way to do this (either with a third-party library or some other hack), or must I "shell out" and unzip it that way (which isn't as cross-platform, obviously). -
Jonathan almost 7 years@s-lott What does
ex= src.read( len(m.extra) )
andcm= src.read( len(m.comment) )
what do you use the variablesex
andcm
for? What do you mean it's good to use a struct to unpack this? And what is the magic number30
used for? -
Benjamin about 4 yearsThe header for each file contains the name of the file at a relative offset of 30 bytes, see en.wikipedia.org/wiki/Zip_(file_format). the extra and comment fields are not relevant, other than that we have to read those bytes to move ahead to the right position.