How to calculate md5 checksum on directory with java or groovy?

23,127

Solution 1

I made a function to calculate MD5 checksum on Directory :

First, I'm using FastMD5: http://www.twmacinta.com/myjava/fast_md5.php

Here is my code :

  def MD5HashDirectory(String fileDir) {
    MD5 md5 = new MD5();
    new File(fileDir).eachFileRecurse{ file ->
      if (file.isFile()) {
        String hashFile = MD5.asHex(MD5.getHash(new File(file.path)));
        md5.Update(hashFile, null);
      }

    }
    String hashFolder = md5.asHex();
    return hashFolder
  }

Solution 2

I had the same requirement and chose my 'directory hash' to be an MD5 hash of the concatenated streams of all (non-directory) files within the directory. As crozin mentioned in comments on a similar question, you can use SequenceInputStream to act as a stream concatenating a load of other streams. I'm using Apache Commons Codec for the MD5 algorithm.

Basically, you recurse through the directory tree, adding FileInputStream instances to a Vector for non-directory files. Vector then conveniently has the elements() method to provide the Enumeration that SequenceInputStream needs to loop through. To the MD5 algorithm, this just appears as one InputStream.

A gotcha is that you need the files presented in the same order every time for the hash to be the same with the same inputs. The listFiles() method in File doesn't guarantee an ordering, so I sort by filename.

I was doing this for SVN controlled files, and wanted to avoid hashing the hidden SVN files, so I implemented a flag to avoid hidden files.

The relevant basic code is as below. (Obviously it could be 'hardened'.)

import org.apache.commons.codec.digest.DigestUtils;

import java.io.*;
import java.util.*;

public String calcMD5HashForDir(File dirToHash, boolean includeHiddenFiles) {

    assert (dirToHash.isDirectory());
    Vector<FileInputStream> fileStreams = new Vector<FileInputStream>();

    System.out.println("Found files for hashing:");
    collectInputStreams(dirToHash, fileStreams, includeHiddenFiles);

    SequenceInputStream seqStream = 
            new SequenceInputStream(fileStreams.elements());

    try {
        String md5Hash = DigestUtils.md5Hex(seqStream);
        seqStream.close();
        return md5Hash;
    }
    catch (IOException e) {
        throw new RuntimeException("Error reading files to hash in "
                                   + dirToHash.getAbsolutePath(), e);
    }

}

private void collectInputStreams(File dir,
                                 List<FileInputStream> foundStreams,
                                 boolean includeHiddenFiles) {

    File[] fileList = dir.listFiles();        
    Arrays.sort(fileList,               // Need in reproducible order
                new Comparator<File>() {
                    public int compare(File f1, File f2) {                       
                        return f1.getName().compareTo(f2.getName());
                    }
                });

    for (File f : fileList) {
        if (!includeHiddenFiles && f.getName().startsWith(".")) {
            // Skip it
        }
        else if (f.isDirectory()) {
            collectInputStreams(f, foundStreams, includeHiddenFiles);
        }
        else {
            try {
                System.out.println("\t" + f.getAbsolutePath());
                foundStreams.add(new FileInputStream(f));
            }
            catch (FileNotFoundException e) {
                throw new AssertionError(e.getMessage()
                            + ": file should never not be found!");
            }
        }
    }

}

Solution 3

Based on Stuart Rossiter's answer but clean code and hidden files properly handled:

import org.apache.commons.codec.digest.DigestUtils;

import java.io.*;
import java.nio.file.Files;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
import java.util.Vector;

public class Hashing {
    public static String hashDirectory(String directoryPath, boolean includeHiddenFiles) throws IOException {
        File directory = new File(directoryPath);
        
        if (!directory.isDirectory()) {
            throw new IllegalArgumentException("Not a directory");
        }

        Vector<FileInputStream> fileStreams = new Vector<>();
        collectFiles(directory, fileStreams, includeHiddenFiles);

        try (SequenceInputStream sequenceInputStream = new SequenceInputStream(fileStreams.elements())) {
            return DigestUtils.md5Hex(sequenceInputStream);
        }
    }

    private static void collectFiles(File directory, List<FileInputStream> fileInputStreams,
                                     boolean includeHiddenFiles) throws IOException {
        File[] files = directory.listFiles();

        if (files != null) {
            Arrays.sort(files, Comparator.comparing(File::getName));

            for (File file : files) {
                if (includeHiddenFiles || !Files.isHidden(file.toPath())) {
                    if (file.isDirectory()) {
                        collectFiles(file, fileInputStreams, includeHiddenFiles);
                    } else {
                        fileInputStreams.add(new FileInputStream(file));
                    }
                }
            }
        }
    }
}

Solution 4

HashCopy is a Java application. It can generate and verify MD5 and SHA on a single file or a directory recursively. I am not sure if it has an API. It can be downloaded from www.jdxsoftware.org.

Solution 5

If you need to do this in a Gradle build file, it's much simpler than with plain Groovy.

Here's an example:

def sources = fileTree('rootDir').matching {
    include 'src/*', 'build.gradle'
}.sort { it.name }
def digest = MessageDigest.getInstance('SHA-1')
sources.each { digest.update(it.bytes) }
digest.digest().encodeHex().toString()

MessageDigest is from the Java std lib: https://docs.oracle.com/javase/8/docs/api/java/security/MessageDigest.html

Algorithms supported in all JVMs are:

MD5
SHA-1
SHA-256
Share:
23,127
Fabien Barbier
Author by

Fabien Barbier

Updated on May 23, 2021

Comments

  • Fabien Barbier
    Fabien Barbier almost 3 years

    I am looking to use java or groovy to get the md5 checksum of a complete directory.

    I have to copy directories for source to target, checksum source and target, and after delete source directories.

    I find this script for files, but how to do the same thing with directories ?

    import java.security.MessageDigest
    
    def generateMD5(final file) {
        MessageDigest digest = MessageDigest.getInstance("MD5")
        file.withInputStream(){ is ->
            byte[] buffer = new byte[8192]
            int read = 0
            while( (read = is.read(buffer)) > 0) {
                digest.update(buffer, 0, read);
            }
        }
        byte[] md5sum = digest.digest()
        BigInteger bigInt = new BigInteger(1, md5sum)
    
        return bigInt.toString(16).padLeft(32, '0')
    }
    

    Is there a better approach ?

    • Dónal
      Dónal almost 14 years
      You should use one of the org.apache.commons.codec.digest.DigestUtils.md5Hex methods in preference to the code above
    • Fabien Barbier
      Fabien Barbier almost 14 years
      I find FastMD5, really easy to find file MD5 : String hash = MD5.asHex(MD5.getHash(new File(filename))); More easy to use and more Fast.
  • Fabien Barbier
    Fabien Barbier almost 14 years
    With groovy (probably also in Java), it can be useful to use Ant (or better Gant). See : ant.apache.org/manual/dirtasks.html
  • Justin Piper
    Justin Piper over 11 years
    That's actually hashing the hashes of the contents of the files, rather than just hashing the contents.
  • Fabien Barbier
    Fabien Barbier over 11 years
    Nice app, but a license is required if HashCopy is used for commercial purpose.
  • Omri Spector
    Omri Spector over 10 years
    Good answer - I decided to use your code. Note though that it is not portable - checking for hidden files should be based on "isHidden", not on the file name. In some cases it is probably best to check both as some programs (e.g. Eclipse) save work files starting with a dot (e.g. .classpath) without making them hidden on non-Unix OS.
  • Stuart Rossiter
    Stuart Rossiter over 10 years
    @OmriSpector Yes, good point re the non-portability and glad you found the snippet useful. That bit was quick and dirty code; I did say "Obviously it could be 'hardened'" :-)
  • best wishes
    best wishes about 6 years
    The answer is awesome, but it misses if file names are changed, retaining alphabetical order, so we can take one more hash with absolute path of file.