Can you use OpenSSL to generate an md5 or sha hash on a directory of files?

21,576

Solution 1

You could recursively generate all the hashes, concatenate the hashes into a single file, then generate a hash of that file.

Solution 2

You can't do a cumulative hash of them all to make a single hash, but you can compress them first then compute the hash:

$tar -czpf archive1.tar.gz folder1/
$tar -czpf archive2.tar.gz folder2/
$openssl md5 archive1.tar.gz archive2.tar.gz

to recursively hash each file:

$find . -type f -exec openssl md5 {} +

Solution 3

You should be probably interested to output the digest in coreutils format (identical to md5sum -b)

So md5sum command could be :

find . -path '*/.svn' -prune -o -type f -print0 | sort | tr '\n' '\0' | xargs -0 openssl dgst -md5 -r 

or with an output to a file

find . -path '*/.svn' -prune -o -type f -print0 | sort | tr '\n' '\0' | xargs -0 openssl dgst -md5 -r > ../mydigest.md5

Solution 4

It's better to list a hash for each file, and check each hash. If you make a hash from all files, and one of them becomes corrupt, you won't know which one is corrupt. But if you list hashes for every file, a script can tell you when any hash doesn't match (which will tell you that a file is corrupt or changed).

Also, recursive hashing with find is simpler than so much piping:

find . -type f -print0 | xargs -0 openssl dgst -sha256 -r >> hashes.sha256

You'll want to append the output via >>, because xargs will invoke openssl several times, but only as often as it needs to process all files (not e.g. one invocation per file). -r is for coreutils hash file syntax. You don't want to use OpenSSL's -out with xargs, as it will overwrite the file on each invocation. Additionally you may want to capture STDERR, in case OpenSSL can't read/open some files: 2>> error.log

If storage isn't a bottleneck then you can use the -P n argument of xargs to run several OpenSSL processes in parallel (not recommended for hard drives).

note: GNU coreutils (md5sum etc.) use OpenSSL as the library for hashing. But you may still want to use OpenSSL instead if your coreutils are very outdated: Support for H/W SHA-hash acceleration was only added recently to OpenSSL. SHA1/SHA256 can be faster than MD5 without acceleration, and are definitely in the gigabit/s range with it.

Solution 5

Doing a md5 sum on the tar would never work unless all of the metadata (creation date, etc.) was identical as well, because tar stores that as part of its archive.

I would probably do an md5 sum of the contents of all of the files:

find folder1 -type f | sort | tr '\n' '\0' | xargs -0 cat | openssl md5
find folder2 -type f | sort | tr '\n' '\0' | xargs -0 cat | openssl md5
Share:
21,576

Related videos on Youtube

Alexander
Author by

Alexander

Updated on September 17, 2022

Comments

  • Alexander
    Alexander almost 2 years

    I'm interested in storing an indicator of file / directory integrity between two archived copies of directories. It's around 1TB of data stored recursively on hard drives. Is there a way using OpenSSL to generate a single hash for all the files that can be used as a comparison between two copies of the data, or at a later point to verify the data has not changed?

  • Alexander
    Alexander over 14 years
    1TB of data - no room to tar them. Is there a way to recursively generate hashes of all files?
  • John T
    John T over 14 years
    yes, added it to my answer.
  • akira
    akira over 14 years
    nice tar idea, but not always applicable. the 'find' method is better in general. if there is 'no room' for the tarball: % tar -cf - folder | openssl md5
  • Victor Rocheron
    Victor Rocheron over 10 years
    For a single command, something like md5 -q <(find . -type f 2>/dev/null | xargs md5 -q | sort) works well in Bash and doesn't require a temp file. Alter if your system uses md5sum instead of md5. Also be aware that sort can behave differently on different platforms which will affect the final checksum if the order is different. Add flags like ! -name ".DS_Store" to the find component to ignore certain files, like the .DS_Store files on Mac OS X that can throw off the checksum since they're generated by the OS.