Compress directory to tar.gz with Commons Compress

32,137

Solution 1

I haven't figured out what exactly was going wrong but a scouring of google caches I found a working example. Sorry for the tumbleweed!

public void CreateTarGZ()
    throws FileNotFoundException, IOException
{
    try {
        System.out.println(new File(".").getAbsolutePath());
        dirPath = "parent/childDirToCompress/";
        tarGzPath = "archive.tar.gz";
        fOut = new FileOutputStream(new File(tarGzPath));
        bOut = new BufferedOutputStream(fOut);
        gzOut = new GzipCompressorOutputStream(bOut);
        tOut = new TarArchiveOutputStream(gzOut);
        addFileToTarGz(tOut, dirPath, "");
    } finally {
        tOut.finish();
        tOut.close();
        gzOut.close();
        bOut.close();
        fOut.close();
    }
}

private void addFileToTarGz(TarArchiveOutputStream tOut, String path, String base)
    throws IOException
{
    File f = new File(path);
    System.out.println(f.exists());
    String entryName = base + f.getName();
    TarArchiveEntry tarEntry = new TarArchiveEntry(f, entryName);
    tOut.putArchiveEntry(tarEntry);

    if (f.isFile()) {
        IOUtils.copy(new FileInputStream(f), tOut);
        tOut.closeArchiveEntry();
    } else {
        tOut.closeArchiveEntry();
        File[] children = f.listFiles();
        if (children != null) {
            for (File child : children) {
                System.out.println(child.getName());
                addFileToTarGz(tOut, child.getAbsolutePath(), entryName + "/");
            }
        }
    }
}

Solution 2

I followed this solution and it worked until I was processing a larger set of files and it randomly crashes after processing 15000 - 16000 files. the following line is leaking file handlers:

IOUtils.copy(new FileInputStream(f), tOut);

and the code crashed with a "Too many open files" error at the OS level The following minor change fix the problem:

FileInputStream in = new FileInputStream(f);
IOUtils.copy(in, tOut);
in.close();

Solution 3

I ended up doing the following:

public URL createTarGzip() throws IOException {
    Path inputDirectoryPath = ...
    File outputFile = new File("/path/to/filename.tar.gz");

    try (FileOutputStream fileOutputStream = new FileOutputStream(outputFile);
            BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(fileOutputStream);
            GzipCompressorOutputStream gzipOutputStream = new GzipCompressorOutputStream(bufferedOutputStream);
            TarArchiveOutputStream tarArchiveOutputStream = new TarArchiveOutputStream(gzipOutputStream)) {

        tarArchiveOutputStream.setBigNumberMode(TarArchiveOutputStream.BIGNUMBER_POSIX);
        tarArchiveOutputStream.setLongFileMode(TarArchiveOutputStream.LONGFILE_GNU);

        List<File> files = new ArrayList<>(FileUtils.listFiles(
                inputDirectoryPath,
                new RegexFileFilter("^(.*?)"),
                DirectoryFileFilter.DIRECTORY
        ));

        for (int i = 0; i < files.size(); i++) {
            File currentFile = files.get(i);

            String relativeFilePath = new File(inputDirectoryPath.toUri()).toURI().relativize(
                    new File(currentFile.getAbsolutePath()).toURI()).getPath();

            TarArchiveEntry tarEntry = new TarArchiveEntry(currentFile, relativeFilePath);
            tarEntry.setSize(currentFile.length());

            tarArchiveOutputStream.putArchiveEntry(tarEntry);
            tarArchiveOutputStream.write(IOUtils.toByteArray(new FileInputStream(currentFile)));
            tarArchiveOutputStream.closeArchiveEntry();
        }
        tarArchiveOutputStream.close();
        return outputFile.toURI().toURL();
    }
}

This takes care of the some of the edge cases that come up in the other solutions.

Solution 4

I had to make some adjustments to @merrick solution to get it to work related to the path. Perhaps with the latest maven dependencies. The currently accepted solution didn't work for me.

import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.commons.compress.archivers.tar.TarArchiveEntry;
import org.apache.commons.compress.archivers.tar.TarArchiveOutputStream;
import org.apache.commons.compress.compressors.gzip.GzipCompressorOutputStream;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.IOUtils;
import org.apache.commons.io.filefilter.DirectoryFileFilter;
import org.apache.commons.io.filefilter.RegexFileFilter;

public class TAR {

    public static void CreateTarGZ(String inputDirectoryPath, String outputPath) throws IOException {

        File inputFile = new File(inputDirectoryPath);
        File outputFile = new File(outputPath);

        try (FileOutputStream fileOutputStream = new FileOutputStream(outputFile);
                BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(fileOutputStream);
                GzipCompressorOutputStream gzipOutputStream = new GzipCompressorOutputStream(bufferedOutputStream);
                TarArchiveOutputStream tarArchiveOutputStream = new TarArchiveOutputStream(gzipOutputStream)) {

            tarArchiveOutputStream.setBigNumberMode(TarArchiveOutputStream.BIGNUMBER_POSIX);
            tarArchiveOutputStream.setLongFileMode(TarArchiveOutputStream.LONGFILE_GNU);

            List<File> files = new ArrayList<>(FileUtils.listFiles(
                    inputFile,
                    new RegexFileFilter("^(.*?)"),
                    DirectoryFileFilter.DIRECTORY
            ));

            for (int i = 0; i < files.size(); i++) {
                File currentFile = files.get(i);

                String relativeFilePath = inputFile.toURI().relativize(
                        new File(currentFile.getAbsolutePath()).toURI()).getPath();

                TarArchiveEntry tarEntry = new TarArchiveEntry(currentFile, relativeFilePath);
                tarEntry.setSize(currentFile.length());

                tarArchiveOutputStream.putArchiveEntry(tarEntry);
                tarArchiveOutputStream.write(IOUtils.toByteArray(new FileInputStream(currentFile)));
                tarArchiveOutputStream.closeArchiveEntry();
            }
            tarArchiveOutputStream.close();
        }
    }
}

Maven

        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.6</version>
        </dependency>

        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-compress</artifactId>
            <version>1.18</version>
        </dependency>

Solution 5

Check below for Apache commons-compress and File walker examples.

This example tar.gz a directory.

public static void createTarGzipFolder(Path source) throws IOException {

        if (!Files.isDirectory(source)) {
            throw new IOException("Please provide a directory.");
        }

        // get folder name as zip file name
        String tarFileName = source.getFileName().toString() + ".tar.gz";

        try (OutputStream fOut = Files.newOutputStream(Paths.get(tarFileName));
             BufferedOutputStream buffOut = new BufferedOutputStream(fOut);
             GzipCompressorOutputStream gzOut = new GzipCompressorOutputStream(buffOut);
             TarArchiveOutputStream tOut = new TarArchiveOutputStream(gzOut)) {

            Files.walkFileTree(source, new SimpleFileVisitor<>() {

                @Override
                public FileVisitResult visitFile(Path file,
                                            BasicFileAttributes attributes) {

                    // only copy files, no symbolic links
                    if (attributes.isSymbolicLink()) {
                        return FileVisitResult.CONTINUE;
                    }

                    // get filename
                    Path targetFile = source.relativize(file);

                    try {
                        TarArchiveEntry tarEntry = new TarArchiveEntry(
                                file.toFile(), targetFile.toString());

                        tOut.putArchiveEntry(tarEntry);

                        Files.copy(file, tOut);

                        tOut.closeArchiveEntry();

                        System.out.printf("file : %s%n", file);

                    } catch (IOException e) {
                        System.err.printf("Unable to tar.gz : %s%n%s%n", file, e);
                    }

                    return FileVisitResult.CONTINUE;
                }

                @Override
                public FileVisitResult visitFileFailed(Path file, IOException exc) {
                    System.err.printf("Unable to tar.gz : %s%n%s%n", file, exc);
                    return FileVisitResult.CONTINUE;
                }

            });

            tOut.finish();
        }

    }

This example extracts a tar.gz, and checks zip slip attack.

public static void decompressTarGzipFile(Path source, Path target)
        throws IOException {

        if (Files.notExists(source)) {
            throw new IOException("File doesn't exists!");
        }

        try (InputStream fi = Files.newInputStream(source);
             BufferedInputStream bi = new BufferedInputStream(fi);
             GzipCompressorInputStream gzi = new GzipCompressorInputStream(bi);
             TarArchiveInputStream ti = new TarArchiveInputStream(gzi)) {

            ArchiveEntry entry;
            while ((entry = ti.getNextEntry()) != null) {

                Path newPath = zipSlipProtect(entry, target);

                if (entry.isDirectory()) {
                    Files.createDirectories(newPath);
                } else {

                    // check parent folder again
                    Path parent = newPath.getParent();
                    if (parent != null) {
                        if (Files.notExists(parent)) {
                            Files.createDirectories(parent);
                        }
                    }

                    // copy TarArchiveInputStream to Path newPath
                    Files.copy(ti, newPath, StandardCopyOption.REPLACE_EXISTING);

                }
            }
        }
    }

    private static Path zipSlipProtect(ArchiveEntry entry, Path targetDir)
        throws IOException {

        Path targetDirResolved = targetDir.resolve(entry.getName());

        Path normalizePath = targetDirResolved.normalize();

        if (!normalizePath.startsWith(targetDir)) {
            throw new IOException("Bad entry: " + entry.getName());
        }

        return normalizePath;
    }

References

  1. https://mkyong.com/java/how-to-create-tar-gz-in-java/
  2. https://commons.apache.org/proper/commons-compress/examples.html
Share:
32,137
awfulHack
Author by

awfulHack

I &lt;3

Updated on August 14, 2020

Comments

  • awfulHack
    awfulHack over 3 years

    I'm running into a problem using the commons compress library to create a tar.gz of a directory. I have a directory structure that is as follows.

    parent/
        child/
            file1.raw
            fileN.raw
    

    I'm using the following code to do the compression. It runs fine without exceptions. However, when I try to decompress that tar.gz, I get a single file with the name "childDirToCompress". Its the correct size so the files have clearly been appended to each other in the tarring process. The desired output would be a directory. I can't figure out what I'm doing wrong. Can any wise commons compresser set me upon the correct path?

    CreateTarGZ() throws CompressorException, FileNotFoundException, ArchiveException, IOException {
                File f = new File("parent");
                File f2 = new File("parent/childDirToCompress");
    
                File outFile = new File(f2.getAbsolutePath() + ".tar.gz");
                if(!outFile.exists()){
                    outFile.createNewFile();
                }
                FileOutputStream fos = new FileOutputStream(outFile);
    
                TarArchiveOutputStream taos = new TarArchiveOutputStream(new GZIPOutputStream(new BufferedOutputStream(fos)));
                taos.setBigNumberMode(TarArchiveOutputStream.BIGNUMBER_STAR); 
                taos.setLongFileMode(TarArchiveOutputStream.LONGFILE_GNU);
                addFilesToCompression(taos, f2, ".");
                taos.close();
    
            }
    
            private static void addFilesToCompression(TarArchiveOutputStream taos, File file, String dir) throws IOException{
                taos.putArchiveEntry(new TarArchiveEntry(file, dir));
    
                if (file.isFile()) {
                    BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file));
                    IOUtils.copy(bis, taos);
                    taos.closeArchiveEntry();
                    bis.close();
                }
    
                else if(file.isDirectory()) {
                    taos.closeArchiveEntry();
                    for (File childFile : file.listFiles()) {
                        addFilesToCompression(taos, childFile, file.getName());
    
                    }
                }
            }
    
  • drigoangelo
    drigoangelo about 10 years
    For future reference, youd don't need to close all concatenated streams. Closing the outermost stream (in this case, tOut) will do.
  • krackoder
    krackoder over 7 years
    This code is giving me an error This archives contains unclosed entries. Any idea what could be the reason for that?
  • ldmtwo
    ldmtwo over 7 years
    Can you share the imports or lib names? I have these libraries (in Maven format), but FileUtils.listFiles() is not found with that erasure. Thanks. org.apache.commons:commons-collections4:4.1, org.apache.commons:commons-compress:1.12, commons-io:commons-io:2.4
  • ldmtwo
    ldmtwo over 7 years
    And the imports: import java.io.*; import java.net.URL; import java.nio.file.Path; import java.util.*; import org.apache.commons.compress.archivers.tar.*; import org.apache.commons.compress.compressors.gzip.*; import org.apache.commons.io.IOUtils; import org.apache.commons.io.filefilter.*;
  • Jire
    Jire almost 7 years
    If anybody is interested in a Kotlin solution that uses this fix and works perfectly, here you go: gist.github.com/Jire/8caa8603ef4871c79bef387f667a509f
  • stuart
    stuart about 6 years
    had the same problem @ldmtwo, just change "inputDirectoryPath" to an actual file (and you can just make the reference a collection and not a new arraylist)
  • conteh
    conteh over 5 years
    @awfulHack I'm also getting 'This archive contains unclosed entries'. Although not on directories that contain another directory before containing files. It's occurring on a directory that has files only and no subdirectories.
  • conteh
    conteh over 5 years
    @awfulHack This worked for me. This should be the solution. The current solution gives an error in some cases.
  • Ethan Shepherd
    Ethan Shepherd about 5 years
    If you are getting the unclosed entries error when running this code, make sure to check your imports. I was missing one (IOUtils) and the finally block was obscuring the issue.
  • Logic
    Logic almost 5 years
    I am getting java.io.FileNotFoundException: XXX/XXX/XXX (Too many open files)