Extract zip to memory, parse contents

10,550

Solution 1

Thank you to everyone that contributed solutions. This is what ended up working for me:

zfile = ZipFile('name.zip', 'r')

        for name in zfile.namelist():
            if fnmatch.fnmatch(name, '*_readme.xml'):
                zopen = zfile.open(name)
                for line in zopen:
                    if re.match('(.*)<foo>(.*)</foo>(.*)', line):
                        print line

Solution 2

IMO just using read is enough:

zfile = ZipFile('name.zip', 'r')
files = []
for name in zfile.namelist():
  if fnmatch.fnmatch(name, '*_readme.xml'):
    files.append(zfile.read(name))

This will make a list with contents of files that match the pattern.

Test: You can then parse contents afterwards by iterating through the list:

for file in files:
  print(file[0:min(35,len(file))].decode()) # "parsing"

Or better use a functor:

import zipfile as zip
import os
import fnmatch

zip_name = os.sys.argv[1]
zfile = zip.ZipFile(zip_name, 'r')

def parse(contents, member_name = ""):
  if len(member_name) > 0:
    print( "Parsed `{}`:".format(member_name) )  
  print(contents[0:min(35, len(contents))].decode()) # "parsing"

for name in zfile.namelist():
  if fnmatch.fnmatch(name, '*.cpp'):
    parse(zfile.read(name), name)

This way there is no data kept in memory for no reason and memory foot print is smaller. It might be important if the files are big.

Solution 3

Don't overthink it. It Just Works:

import zipfile

# 1) I want to read the contents of a zip file ...
with zipfile.ZipFile('A-Zip-File.zip') as zipper:
  # 2) ... find a particular file in the archive, open the file ...
  with zipper.open('A-Particular-File.txt') as fp:
    # 3) ... and extract a line from it.
    first_line = fp.readline()

print first_line

Solution 4

The question you link shows you that you need to read the file. Depending on your use case that may already be enough. In your code you replace the loop variable holding a filename with an empty string buffer. Try something like this:

zfile = ZipFile('name.zip', 'r')

for name in zfile.namelist():
    if fnmatch.fnmatch(name, '*_readme.xml'):
        ex_file = zfile.open(name) # this is a file like object
        content = ex_file.read() # now file-contents are a single string

If you really want a buffer that you can manipulate, then simply instantiate it with the contents:

buf = StringIO(zfile.open(name).read())

You may also want to look at BytesIO and note that there are differences between Python 2 and 3.

Share:
10,550
Captain Caveman
Author by

Captain Caveman

Thousands of years in the past I became the world's first superhero. In my early days, I had the secret identity of Chester, an office boy, where I worked with cavewomen reporters Wilma Flintstone and Betty Rubble. I appeared in a TV show that young Wilma, Betty, Fred Flintstone and Barney Rubble would watch as children. At some point I ended up frozen in a block of ice for thousands of years, until I was unfrozen by the girl's detective group known as the Teen Angels. I then joined them on their quest to solve mysteries and fight crime.

Updated on July 12, 2022

Comments

  • Captain Caveman
    Captain Caveman almost 2 years

    I want to read the contents of a zip file into memory rather than extracting them to disc, find a particular file in the archive, open the file and extract a line from it.

    Can a StringIO instance be opened and parsed? Suggestions? Thanks in advance.

    zfile = ZipFile('name.zip', 'r')
    
        for name in zfile.namelist():
            if fnmatch.fnmatch(name, '*_readme.xml'):
                name = StringIO.StringIO()
                print name # prints StringIO instances
                open(name, 'r')  # IO Error: No such file or directory...
    

    I found a few similar posts, but none that seem to address this issue: Extracting a zipfile to memory?