Reading a huge Zip file in java - Out of Memory Error

10,909

Solution 1

It is very unlikley that you get an out of memory exception because of processing a ZIP file. The Java classes ZipFile and ZipEntry don't contain anything that could possibly fill up 613 MB of memory.

What could exhaust your memory is to keep the decompressed files of the ZIP archive in memory, or - even worse - keeping them as an XML DOM, which is very memory intensive.

Switching to another ZIP library will hardly help. Instead, you should look into changing your code so that it processes the ZIP archive and the contained files like streams and only keeps a limited part of each file in memory at a time.

BTW: I would be nice if you could provide more information about the huge ZIP files (do they contain many small files or few large files?) and about what you do with each ZIP entry.

Update:

Thanks for the additional information. It looks like you keep the contents of the ZIP file in memory (although it somewhat depends on the implementation of the S3Object class, which I don't know).

It's probably best to implement some sort of batching as you propose yourself. You could for example add up the decompressed size of each ZIP entry and upload the files every time the total size exceeds 100 MB.

Solution 2

You're using ZipFile class now, as I see. Probably usage ZipInputStream will be a better option because it has 'closeEntry()' method which (as I hope) deallocates memory resources used by an entry. But I haven't used it before, it's just a guess.

Share:
10,909

Related videos on Youtube

sethu
Author by

sethu

SOreadytohelp I am a java developer and have worked a lot on Swing and core java.

Updated on June 11, 2022

Comments

  • sethu
    sethu over 1 year

    I am reading a ZIP file using java as below:

    Enumeration<? extends ZipEntry> zes=zip.entries();
        while(zes.hasMoreElements()) {
            ZipEntry ze=zes.nextElement();
            // do stuff..
        }
    

    I am getting an out of memory error, the zip file size is about 160MB. The stacktrace is as below:

    Exception in thread "Timer-0" java.lang.OutOfMemoryError: Java heap space
    at java.util.zip.InflaterInputStream.<init>(InflaterInputStream.java:88)
    at java.util.zip.ZipFile$1.<init>(ZipFile.java:229)
    at java.util.zip.ZipFile.getInputStream(ZipFile.java:229)
    at java.util.zip.ZipFile.getInputStream(ZipFile.java:197)
    at com.aesthete.csmart.batches.batchproc.DatToInsertDBBatch.zipFilePass2(DatToInsertDBBatch.java:250)
    at com.aesthete.csmart.batches.batchproc.DatToInsertDBBatch.processCompany(DatToInsertDBBatch.java:206)
    at com.aesthete.csmart.batches.batchproc.DatToInsertDBBatch.run(DatToInsertDBBatch.java:114)
    at java.util.TimerThread.mainLoop(Timer.java:534)
    at java.util.TimerThread.run(Timer.java:484)
    

    How do I enumerate the contents of a big zip file without having increase my heap size? Also when I dont enumerate the contents and just access a single file like this:

    ZipFile zip=new ZipFile(zipFile);
    ZipEntry ze=zip.getEntry("docxml.xml");
    

    Then I dont get an out of memory error. Why does this happen? How does a Zip file handle zip entries? The other option would be to use a ZIPInputStream. Would that have a small memory footprint. I would need to run this code eventually on a micro EC2 instance on the Amazon cloud (613 MB RAM)

    EDIT: providing more information on how I process the zip entries after I get them

    Enumeration<? extends ZipEntry> zes=zip.entries();
        while(zes.hasMoreElements()) {
            ZipEntry ze=zes.nextElement();
            S3Object s3Object=new S3Object(bkp.getCompanyFolder()+map.get(ze.getName()).getRelativeLoc());
                s3Object.setDataInputStream(zip.getInputStream(ze));
                s3Object.setStorageClass(S3Object.STORAGE_CLASS_REDUCED_REDUNDANCY);
                s3Object.addMetadata("x-amz-server-side-encryption", "AES256");
                s3Object.setContentType(Mimetypes.getInstance().getMimetype(s3Object.getKey()));
                s3Object.setContentDisposition("attachment; filename="+FilenameUtils.getName(s3Object.getKey()));
                s3objs.add(s3Object);
        }
    

    I get the zipinputstream from the zipentry and store that in the S3object. I collect all the S3Objects in a list and then finally upload them to Amazon S3. For those who dont know Amazon S3, its a file storage service. You upload the file via HTTP.

    I am thinking maybe since i collect all the individual inputstreams this is happening? Would it help if I batched it up? Like a 100 inputstreams at a time? Or would it be better if I unzipped it first and then used the unzipped file to upload rather storing streams?

    • Kieren Johnstone
      Kieren Johnstone almost 12 years
      The Micro EC2 instance type will not be suitable for unzipping large files. It supports only very brief periods of CPU work. If the unzipping takes longer than 2-5sec, then quite simply on a Micro instance this Will Not Work(tm). [They're suited to quick and simple web request handling only, really: even something like installing the .NET framework takes ~30min because it uses CPU]
    • sethu
      sethu almost 12 years
      Kieren, at the moment I am running it on a local ubuntu server which has 2GM RAM :). If it fails here, I am sure it wont work on the micro instance, hence the question. But once I fix it using Codo suggestion, do you still think there might be an issue? All I am doing is, downloading a zip file from S3, unzipping it and uploading it back to S3 in a java batch program. Would that be CPU intensive? Also I running a tomcat and mysql db on that same instance. Would it become that bad?
    • Kieren Johnstone
      Kieren Johnstone almost 12 years
      If it takes less than 10 seconds you are going to be OK, unless you need to do it frequently. If it takes more than 10, the CPU available to the instance will be cut very short, and it will probably take a few minutes, slowing down the whole instance
    • sethu
      sethu almost 12 years
      To process each zip file it will less than 10s for sure. But there will be many of them that need to processed at a time. But I understand what you are saying. But either way I dont have a choice. Cant afford a small instance. So will have to live with the instance not slowing down too much. As long as it is bearable. With my batch running it was making the CPU go upto 10%. (checked by running top) I think thats okay isn't it?
  • sethu
    sethu almost 12 years
    Unfortunately, I cant do that because my RAM size is limited. A max of 613 MB actually.
  • Craig van Wilson
    Craig van Wilson almost 12 years
    The Java Tutorial has a nice section on Zip file handling. java.sun.com/developer/technicalArticles/Programming/…
  • sethu
    sethu almost 12 years
    Thanks for the answer. I am sure you are right. I have edited my question with more information on how I am processing the zip file.. Could you please check that?
  • sethu
    sethu almost 12 years
    Also the number of files are large but each file is small. Max size 5MB. Mainly small pdf forms and excel and doc documents.
  • Coke
    Coke over 10 years
    The Java Tutorial link is now to a generic Oracle Java page. Anyone have an updated URL?