Linux: compute a single hash for a given folder & contents?

linux bash hash

77,861

Solution 1

One possible way would be:

sha1sum path/to/folder/* | sha1sum

If there is a whole directory tree, you're probably better off using find and xargs. One possible command would be

find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum

And, finally, if you also need to take account of permissions and empty directories:

(find path/to/folder -type f -print0  | sort -z | xargs -0 sha1sum;
 find path/to/folder \( -type f -o -type d \) -print0 | sort -z | \
   xargs -0 stat -c '%n %a') \
| sha1sum

The arguments to stat will cause it to print the name of the file, followed by its octal permissions. The two finds will run one after the other, causing double the amount of disk IO, the first finding all file names and checksumming the contents, the second finding all file and directory names, printing name and mode. The list of "file names and checksums", followed by "names and directories, with permissions" will then be checksummed, for a smaller checksum.

Solution 2

Use a file system intrusion detection tool like aide.
hash a tar ball of the directory:

tar cvf - /path/to/folder | sha1sum
Code something yourself, like vatine's oneliner:

find /path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum

Solution 3

If you just want to check if something in the folder changed, I'd recommend this one:

ls -alR --full-time /folder/of/stuff | sha1sum

It will just give you a hash of the ls output, that contains folders, sub-folders, their files, their timestamp, size and permissions. Pretty much everything that you would need to determine if something has changed.

Please note that this command will not generate hash for each file, but that is why it should be faster than using find.

Solution 4

You can do tar -c /path/to/folder | sha1sum

Solution 5

So far the fastest way to do it is still with tar. And with several additional parameters we can also get rid of the difference caused by metadata.

To use tar for hash the dir, one need to make sure you sort the path during tar, otherwise it is always different.

tar -C <root-dir> -cf - --sort=name <dir> | sha256sum

ignore time

If you do not care about the access time or modify time also use something like --mtime='UTC 2019-01-01' to make sure all timestamp is the same.

ignore ownership

Usually we need to add --group=0 --owner=0 --numeric-owner to unify the owner metadata.

ignore some files

use --exclude=PATTERN

View more solutions

77,861

Author by

Ben L

Professional experience in (ordered by recency): Clojure Java Python C# C++ Platform experience: Android, Unix, Windows, iOS.

Updated on July 21, 2022

Comments

Ben L almost 2 years
Surely there must be a way to do this easily!

I've tried the Linux command-line apps such as sha1sum and md5sum but they seem only to be able to compute hashes of individual files and output a list of hash values, one for each file.

I need to generate a single hash for the entire contents of a folder (not just the filenames).

I'd like to do something like
```
sha1sum /folder/of/stuff > singlehashvalue
```
Edit: to clarify, my files are at multiple levels in a directory tree, they're not all sitting in the same root folder.
leo7r over 15 years

If you sort after the first sha1sum, then a LF in a filename should do no harm.
Aaron Digulla over 15 years

Edited. Sort can work on 0 delimited lists with the -z option.
David Schmitt over 15 years

and don't forget to set LC_ALL=POSIX, so the various tools create locale independent output.
slowdog over 13 years

If you want to replicate that checksum on a different machine, tar might not be a good choice, as the format seems to have room for ambiguity and exist in many versions, so the tar on another machine might produce different output from the same files.
Bruno Bronosky about 13 years

I found cat | sha1sum to be considerably faster than sha1sum | sha1sum. YMMV, try each of these on your system: time find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum; time find path/to/folder -type f -print0 | sort -z | xargs -0 cat | sha1sum
mivk about 12 years

for F in 'find ...' ... doesn't work when you have spaces in names (which you always do nowadays).
Vatine over 11 years

@RichardBronosky - Let us assume we have two files, A and B. A contains "foo" and B contains "bar was here". With your method, we would not be able to separate that from two files C and D, where C contains "foobar" and D contains " was here". By hashing each file individually and then hash all "filename hash" pairs, we can see the difference.
Bruno Bronosky over 11 years

+1 for the tar solution. That is the fastest, but drop the v. verbosity only slows it down.
robbles over 11 years

To make this work irrespective of the directory path (i.e. when you want to compare the hashes of two different folders), you need to use a relative path and change to the appropriate directory, because the paths are included in the final hash: find ./folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum
Vatine over 11 years

@robbles That is correct and why I did not put an initial / on the path/to/folder bit.
hopla over 11 years

You could also have your hashtool print out only the hashes, on FreeBSD for example: xargs -0 sha256 -q (Also, in your anwser, you might want to draw attention to the fact that (absolute) filenames are printed out with the hashes)
Vatine over 11 years

@hopla Relativified paths throughout instead of just in the final example.
nos over 11 years

note that the tar soluition assumes the files are in the same order when you compare them. Whether they are would depend on the file system the files resides in when doing the comparison.
hopla over 11 years

Much clearer :) I've also been think that using relative paths is better than the -q option, because then all the file names are taken into account in the final hash as well, avoiding problems should a hash collision ever occur.
Mamoun Benghezal almost 9 years

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
Vatine over 8 years

@JasonS Define "large"? You're looking at roughly linear run-time in pure data volume (consumed by sha1sum or equivalent hashing). You're looking at (roughly) linear performance from find. Sorting is probably o(n log n), file "number of files" as n. Until growth in "log n" starts being significant, time will be dominated by the disk bandwidth. Waving a hand vagiuely in the air, I'd say you'd be OK for "tens to hundreds of thousands of files". At some point, the list of hashes-per-file to sort may require spilling to disk, so there's going to be a vicious cliff in the time complexity curve.
Jason S over 8 years

no, I'm worried about the large command-line; xargs makes a single call to sha1sum, right? is there a limit in command-line size?
Vatine over 8 years

@JasonS Ah, no, the reason for xargs is that it intelligently splits the incoming stream of "filenames to hash" from find into suitable chunks (defaults to, um, something low (basically, it depends on the system, but the default should always be safe).
Binary Phile over 8 years

slowdog's valid concerns notwithstanding, if you care about file contents, permissions, etc. but not modification time, you can add the --mtime option like so: tar -c /path/to/folder --mtime="1970-01-01" | sha1sum.
Binary Phile over 8 years

While this command looks to work well for a certain use case, it doesn't seem to include what may be relevant details such as directory names as well as file permissions. I'm sure there's more than one way to skin the cat though.
Vatine over 8 years

@BinaryPhile That is correct, but not what the question originally asked for. All directories with contents will have their names as part of the final hash, though (they're part of the file names). It would be possible to include the permissions, but would require (some) thought, as a plain "ls -l" would include date and time information that is (probably) not relevant.
CMCDragonkai over 8 years

So this doesn't capture the permissions?
CMCDragonkai over 8 years

This also doesn't capture empty directories.
Vatine over 8 years

@CMCDragonkai No, it only captures file contents, making sure to respect file boundaries. If you also want to include permissions and empty directories, it would be possible to add something like find path/to/folder \( -type f -o -type d \) -print0 | sort -z | xargs stat -c "%n %a". Let me edit the question...
Mark Kreyman almost 8 years

To account for differences in sort algorithms between my Mac and RHEL 5.x server, I had to slightly modify the command: find ./folder -type f -print0 | xargs -0 sha1sum | sort -df | sha1sum
Dave C over 7 years

I'm unsure why this doesn't have more upvotes given the simplicity of the solution. Can anyone explain why this wouldn't work well?
Ryota over 7 years

I suppose this isn't ideal as the generated hash will be based on file owner, date-format setup, etc.
Shumoapp over 7 years

The ls command can be customized to output whatever you want. You can replace -l with -gG to omit the group and the owner. And you can change the date format with the --time-style option. Basically check out the ls man page and see what suits your needs.
Kasun Siyambalapitiya almost 7 years

@S.Lott if the directory size is big, I mean if the size of the directory is so big, zipping it and getting md5 on it will take more time
Zoltan about 6 years

The git hash is not suitable for this purpose since file contents are only a part of its input. Even for the initial commit of a branch, the hash is affected by the commit message and the commit metadata as well, like the time of the commit. If you commit the same directory structure multiple times, you will get different hash every time, thus the resulting hash is not suitable for determining whether two directories are exact copies of each other by only sending the hash over.
Navin almost 6 years

@DaveC Because it's pretty much useless. If you want to compare filenames, just compare them directly. They're not that big.
yashma almost 6 years

@Navin From the question it is not clear whether it is necessary to hash file contents or detect a change in a tree. Each case has its uses. Storing 45K filenames in a kernel tree, for example, is less practical than a single hash. ls -lAgGR --block-size=1 --time-style=+%s | sha1sum works great for me
Bernard about 5 years

Be careful with find. Running the script on find /some/path/dir1 -type f ... and find /someother/path/dir2 -type f ... will return different checksums even if the content of dir1 and dir2 is identical. You need to cd /some/path/dir1 before calling find . -type f ...
hobbs about 5 years

@Zoltan the git hash is perfectly fine, if you use a tree hash and not a commit hash.
Zoltan about 5 years

@hobbs The answer originally stated "commit hash", which is certainly not fit for this purpose. The tree hash sounds like a much better candidate, but there could still be hidden traps. One that comes to my mind is that having the executable bit set on some files changes the tree hash. You have to issue git config --local core.fileMode false before committing to avoid this. I don't know whether there are any more caveats like this.
thinktt over 4 years

I'm having an issue where the xargs output, the list of hashes for my files, are not reliably coming out in the same order. Any idea why that might be happening? Could it be an issue with the sort command?
thinktt over 4 years

This seems much simpler than the accepted answer for hashing a directory. I wasn't finding the accepted answer reliable. One issue... is there a chance the hashes could come out in a different order? sha256sum /tmp/thd-agent/* | sort is what i'm trying for a reliable ordering, then just hashing that.
NVRM over 4 years

Hi, looks like the hashes comes in alphabetical order by default. What do you mean by reliable ordering? You have to organize all that by yourself. For example using associative arrays, entry + hash. Then you sort this array by entry, this gives a list of computed hashes in the sort order. I believe you can use a json object otherwise, and hash the whole object directly.
Vatine over 4 years

@thinktt No obvious idea why. You could try replacing xargs with echo to check that the arguments are being passed through in a consistent order. Also remember that you (probably) want to ensure you're not using any localisation for sorting.
thinktt over 4 years

If I understand you're saying it hashes the files in alphabetical order. That seems right. Something in the accepted answer above was giving me intermittent different orders sometimes, so I'm just trying to make sure that doesn't happen again. I'm going to stick with putting sort at the end. Seems to be working. Only issue with this method vs accepted answer I see is it doesn't deal with nested folders. In my case I don't have any folders so this works great.
NVRM over 4 years

what about ls -r | sha256sum ?
Ferit about 4 years

Can you give a brief example to get a robust and clean sha256 of a folder, maybe for a Windows folder with three subdirectories and a few files in there each?
John McGehee almost 4 years

For many applications this approach is superior. Hashing just the source code files gets a sufficiently unique hash in a lot less time.
Gi0rgi0s almost 4 years

@NVRM tried it and it just checked for file name changes, not the file content
Andrew Klaassen over 3 years

@nos: With recent versions of GNU tar, sort order can be enforced with --sort=name.
Andrew Klaassen over 3 years

This is the best answer involving GNU tar, since it ensures that file contents and directory structure are consistently compared.
tcrafton about 3 years

Warning: not all versions of tar have --sort :-(
tcrafton about 3 years

this needs a | sort before the last sha1sum to get consistent results (unless tqdm takes care of that? I didn't test with tqdm)
Sergey Lukin over 2 years

finally something that is consistent across environments. THANKS!!
FarisHijazi over 2 years

that's correct I just added that without seeing your commend, and now I wish I saw yours before.
Torsten Bronger over 2 years

Note that even for the highly regarded rsync, comparing timestamps and file sizes is sufficient by default.
Ejdrien over 2 years

This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From Review
Gabriel Staples about 2 years

This answer doesn't produce identical hashes for identical folders in different locations on your file system. That's a big short-coming. I explain why, and present a fix to it, as well as two bash functions I wrote: sha256sum_dir and diff_dir, in my new answer here.
M Imam Pratama about 2 years

Use shopt -s globstar, so we can do it recursively: sha1sum path/to/folder/** | sha1sum