java.util.zip - ZipInputStream v.s. ZipFile

17,403

Solution 1

Q1: yes, order will be the same in which entries were added.

Q2: note that due to structure of zip archive files, and compression, none of solutions is exactly streaming; they all do some level of buffering. And if you check out JDK sources, implementations share most code. There is no real random access to within content, although index does allow finding chunks that correspond to entries. So I think there should not be meaningful performance differences; especially as OS will do caching of disk blocks anyway. You may want to just test performance to verify this with a simple test case.

Q3: I would not count on this; and most likely they aren't. If you really think concurrent access would help (mostly because decompression is CPU bound, so it might help), I'd try reading the whole file in memory, expose via ByteArrayInputStream, and construct multiple independent readers.

Solution 2

I measured that just listing the files with ZipInputStream is 8 times slower than with ZipFile.

    long t = System.nanoTime();
    ZipFile zip = new ZipFile(jarFile);
    Enumeration<? extends ZipEntry> entries = zip.entries();
    while (entries.hasMoreElements())
    {
        ZipEntry entry = entries.nextElement();

        String filename = entry.getName();
        if (!filename.startsWith(JAR_TEXTURE_PATH))
            continue;

        textureFiles.add(filename);
    }
    zip.close();
    System.out.println((System.nanoTime() - t) / 1e9);

and

    long t = System.nanoTime();
    ZipInputStream zip = new ZipInputStream(new FileInputStream(jarFile));
    ZipEntry entry;
    while ((entry = zip.getNextEntry()) != null)
    {
        String filename = entry.getName();
        if (!filename.startsWith(JAR_TEXTURE_PATH))
            continue;

        textureFiles.add(filename);
    }
    zip.close();
    System.out.println((System.nanoTime() - t) / 1e9);

(Don't run them in the same class. Make two different classes and run them separately)

Solution 3

Regarding Q3, experience in JENKINS-14362 suggests that zlib is not thread-safe even when operating on unrelated streams, i.e. that it has some improperly shared static state. Not proven, just a warning.

Share:
17,403
Lachezar Balev
Author by

Lachezar Balev

Software developer since 2000. I also do that for fun and as a hobby. Keen motorbiker and plant grower when offline.

Updated on June 18, 2022

Comments

  • Lachezar Balev
    Lachezar Balev almost 2 years

    I have some general questions regarding the java.util.zip library. What we basically do is an import and an export of many small components. Previously these components were imported and exported using a single big file, e.g.:

    <component-type-a id="1"/>
    <component-type-a id="2"/>
    <component-type-a id="N"/>
    
    <component-type-b id="1"/>
    <component-type-b id="2"/>
    <component-type-b id="N"/>
    

    Please note that the order of the components during import is relevant.

    Now every component should occupy its own file which should be externally versioned, QA-ed, bla, bla. We decided that the output of our export should be a zip file (with all these files in) and the input of our import should be a similar zip file. We do not want to explode the zip in our system. We do not want opening separate streams for each of the small files. My current questions:

    Q1. May the ZipInputStream guarantee that the zip entries (the little files) will be read in the same order in which they were inserted by our export that uses ZipOutputStream? I assume reading is something like:

    
    ZipInputStream zis = new ZipInputStream(new BufferedInputStream(fis));
    ZipEntry entry;
    while((entry = zis.getNextEntry()) != null) 
    {
           //read from zis until available
    }
    

    I know that the central zip directory is put at the end of the zip file but nevertheless the file entries inside have sequential order. I also know that relying on the order is an ugly idea but I just want to have all the facts in mind.

    Q2. If I use ZipFile (which I prefer) what is the performance impact of calling getInputStream() hundreds of times? Will it be much slower than the ZipInputStream solution? The zip is opened only once and ZipFile is backed by RandomAccessFile - is this correct? I assume reading is something like:

    
    ZipFile zipfile = new ZipFile(argv[0]);
    Enumeration e = zipfile.entries();//TODO: assure the order of the entries
    while(e.hasMoreElements()) {
            entry = (ZipEntry) e.nextElement();
            is = zipfile.getInputStream(entry));
    }
    

    Q3. Are the input streams retrieved from the same ZipFile thread safe (e.g. may I read different entries in different threads simultaneously)? Any performance penalties?

    Thanks for your answers!

  • Lachezar Balev
    Lachezar Balev over 13 years
    Hi StaxMan! I was just checking the implementation of the ZipFile$ZipFileInputStream in JDK6. This is returned by ZipFile.getInputStream It has synchronization though I really do not know how reliable is that.
  • StaxMan
    StaxMan over 13 years
    Yeah, I can't say for sure it is non-thread-safe. One more dangerous part is the underlying native zlib library, which I suspect is not thread-safe.
  • Joel
    Joel over 13 years
    I can testify to the fact that it's not threadsafe, through painful experience.
  • rogerdpack
    rogerdpack about 6 years
    My hunch is ZipFile is reading the zip index while ZipInputStream is "looping through" the entire zip file reading one file after another, FWIW.