How do you unzip very large files in python?

python compression zip unzip

14,055

Solution 1

Here's an outline of decompression of large files.

import zipfile
import zlib
import os

src = open( doc, "rb" )
zf = zipfile.ZipFile( src )
for m in  zf.infolist():

    # Examine the header
    print m.filename, m.header_offset, m.compress_size, repr(m.extra), repr(m.comment)
    src.seek( m.header_offset )
    src.read( 30 ) # Good to use struct to unpack this.
    nm= src.read( len(m.filename) )
    if len(m.extra) > 0: ex= src.read( len(m.extra) )
    if len(m.comment) > 0: cm= src.read( len(m.comment) ) 

    # Build a decompression object
    decomp= zlib.decompressobj(-15)

    # This can be done with a loop reading blocks
    out= open( m.filename, "wb" )
    result= decomp.decompress( src.read( m.compress_size ) )
    out.write( result )
    result = decomp.flush()
    out.write( result )
    # end of the loop
    out.close()

zf.close()
src.close()

Solution 2

As of Python 2.6, you can use ZipFile.open() to open a file handle on a file, and copy contents efficiently to a target file of your choosing:

import errno
import os
import shutil
import zipfile

TARGETDIR = '/foo/bar/baz'

with open(doc, "rb") as zipsrc:
    zfile = zipfile.ZipFile(zipsrc)
    for member in zfile.infolist():
       target_path = os.path.join(TARGETDIR, member.filename)
       if target_path.endswith('/'):  # folder entry, create
           try:
               os.makedirs(target_path)
           except (OSError, IOError) as err:
               # Windows may complain if the folders already exist
               if err.errno != errno.EEXIST:
                   raise
           continue
       with open(target_path, 'wb') as outfile, zfile.open(member) as infile:
           shutil.copyfileobj(infile, outfile)

This uses shutil.copyfileobj() to efficiently read data from the open zipfile object, copying it over to the output file.

14,055

Author by

Marc Novakowski

I'm a Distinguished Software Engineer at SiriusXM/Pandora. I've been involved in many projects and features since 2005 including lead developer for the BlackBerry app, 2012 HTML5 site redesign, and for the latest playback APIs.

Updated on June 06, 2022

Comments

Marc Novakowski almost 2 years

Using python 2.4 and the built-in ZipFile library, I cannot read very large zip files (greater than 1 or 2 GB) because it wants to store the entire contents of the uncompressed file in memory. Is there another way to do this (either with a third-party library or some other hack), or must I "shell out" and unzip it that way (which isn't as cross-platform, obviously).
Jonathan almost 7 years

@s-lott What does ex= src.read( len(m.extra) ) and cm= src.read( len(m.comment) ) what do you use the variables ex and cm for? What do you mean it's good to use a struct to unpack this? And what is the magic number 30 used for?
Benjamin about 4 years

The header for each file contains the name of the file at a relative offset of 30 bytes, see en.wikipedia.org/wiki/Zip_(file_format). the extra and comment fields are not relevant, other than that we have to read those bytes to move ahead to the right position.