Linux: compute a single hash for a given folder & contents?
Solution 1
One possible way would be:
sha1sum path/to/folder/* | sha1sum
If there is a whole directory tree, you're probably better off using find and xargs. One possible command would be
find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum
And, finally, if you also need to take account of permissions and empty directories:
(find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum;
find path/to/folder \( -type f -o -type d \) -print0 | sort -z | \
xargs -0 stat -c '%n %a') \
| sha1sum
The arguments to stat
will cause it to print the name of the file, followed by its octal permissions. The two finds will run one after the other, causing double the amount of disk IO, the first finding all file names and checksumming the contents, the second finding all file and directory names, printing name and mode. The list of "file names and checksums", followed by "names and directories, with permissions" will then be checksummed, for a smaller checksum.
Solution 2
Use a file system intrusion detection tool like aide.
-
hash a tar ball of the directory:
tar cvf - /path/to/folder | sha1sum
-
Code something yourself, like vatine's oneliner:
find /path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum
Solution 3
If you just want to check if something in the folder changed, I'd recommend this one:
ls -alR --full-time /folder/of/stuff | sha1sum
It will just give you a hash of the ls output, that contains folders, sub-folders, their files, their timestamp, size and permissions. Pretty much everything that you would need to determine if something has changed.
Please note that this command will not generate hash for each file, but that is why it should be faster than using find.
Solution 4
You can do tar -c /path/to/folder | sha1sum
Solution 5
So far the fastest way to do it is still with tar. And with several additional parameters we can also get rid of the difference caused by metadata.
To use tar for hash the dir, one need to make sure you sort the path during tar, otherwise it is always different.
tar -C <root-dir> -cf - --sort=name <dir> | sha256sum
ignore time
If you do not care about the access time or modify time also use something like --mtime='UTC 2019-01-01'
to make sure all timestamp is the same.
ignore ownership
Usually we need to add --group=0 --owner=0 --numeric-owner
to unify the owner metadata.
ignore some files
use --exclude=PATTERN
Ben L
Professional experience in (ordered by recency): Clojure Java Python C# C++ Platform experience: Android, Unix, Windows, iOS.
Updated on July 21, 2022Comments
-
Ben L almost 2 years
Surely there must be a way to do this easily!
I've tried the Linux command-line apps such as
sha1sum
andmd5sum
but they seem only to be able to compute hashes of individual files and output a list of hash values, one for each file.I need to generate a single hash for the entire contents of a folder (not just the filenames).
I'd like to do something like
sha1sum /folder/of/stuff > singlehashvalue
Edit: to clarify, my files are at multiple levels in a directory tree, they're not all sitting in the same root folder.
-
leo7r over 15 yearsIf you sort after the first sha1sum, then a LF in a filename should do no harm.
-
Aaron Digulla over 15 yearsEdited. Sort can work on 0 delimited lists with the -z option.
-
David Schmitt over 15 yearsand don't forget to set LC_ALL=POSIX, so the various tools create locale independent output.
-
slowdog over 13 yearsIf you want to replicate that checksum on a different machine, tar might not be a good choice, as the format seems to have room for ambiguity and exist in many versions, so the tar on another machine might produce different output from the same files.
-
Bruno Bronosky about 13 yearsI found cat | sha1sum to be considerably faster than sha1sum | sha1sum. YMMV, try each of these on your system: time find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum; time find path/to/folder -type f -print0 | sort -z | xargs -0 cat | sha1sum
-
mivk about 12 years
for F in 'find ...' ...
doesn't work when you have spaces in names (which you always do nowadays). -
Vatine over 11 years@RichardBronosky - Let us assume we have two files, A and B. A contains "foo" and B contains "bar was here". With your method, we would not be able to separate that from two files C and D, where C contains "foobar" and D contains " was here". By hashing each file individually and then hash all "filename hash" pairs, we can see the difference.
-
Bruno Bronosky over 11 years+1 for the tar solution. That is the fastest, but drop the v. verbosity only slows it down.
-
robbles over 11 yearsTo make this work irrespective of the directory path (i.e. when you want to compare the hashes of two different folders), you need to use a relative path and change to the appropriate directory, because the paths are included in the final hash:
find ./folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum
-
Vatine over 11 years@robbles That is correct and why I did not put an initial
/
on thepath/to/folder
bit. -
hopla over 11 yearsYou could also have your hashtool print out only the hashes, on FreeBSD for example: xargs -0 sha256 -q (Also, in your anwser, you might want to draw attention to the fact that (absolute) filenames are printed out with the hashes)
-
Vatine over 11 years@hopla Relativified paths throughout instead of just in the final example.
-
nos over 11 yearsnote that the tar soluition assumes the files are in the same order when you compare them. Whether they are would depend on the file system the files resides in when doing the comparison.
-
hopla over 11 yearsMuch clearer :) I've also been think that using relative paths is better than the -q option, because then all the file names are taken into account in the final hash as well, avoiding problems should a hash collision ever occur.
-
Mamoun Benghezal almost 9 yearsWhile this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
-
Vatine over 8 years@JasonS Define "large"? You're looking at roughly linear run-time in pure data volume (consumed by
sha1sum
or equivalent hashing). You're looking at (roughly) linear performance fromfind
. Sorting is probably o(n log n), file "number of files" as n. Until growth in "log n" starts being significant, time will be dominated by the disk bandwidth. Waving a hand vagiuely in the air, I'd say you'd be OK for "tens to hundreds of thousands of files". At some point, the list of hashes-per-file to sort may require spilling to disk, so there's going to be a vicious cliff in the time complexity curve. -
Jason S over 8 yearsno, I'm worried about the large command-line; xargs makes a single call to sha1sum, right? is there a limit in command-line size?
-
Vatine over 8 years@JasonS Ah, no, the reason for xargs is that it intelligently splits the incoming stream of "filenames to hash" from find into suitable chunks (defaults to, um, something low (basically, it depends on the system, but the default should always be safe).
-
Binary Phile over 8 yearsslowdog's valid concerns notwithstanding, if you care about file contents, permissions, etc. but not modification time, you can add the
--mtime
option like so:tar -c /path/to/folder --mtime="1970-01-01" | sha1sum
. -
Binary Phile over 8 yearsWhile this command looks to work well for a certain use case, it doesn't seem to include what may be relevant details such as directory names as well as file permissions. I'm sure there's more than one way to skin the cat though.
-
Vatine over 8 years@BinaryPhile That is correct, but not what the question originally asked for. All directories with contents will have their names as part of the final hash, though (they're part of the file names). It would be possible to include the permissions, but would require (some) thought, as a plain "ls -l" would include date and time information that is (probably) not relevant.
-
CMCDragonkai over 8 yearsSo this doesn't capture the permissions?
-
CMCDragonkai over 8 yearsThis also doesn't capture empty directories.
-
Vatine over 8 years@CMCDragonkai No, it only captures file contents, making sure to respect file boundaries. If you also want to include permissions and empty directories, it would be possible to add something like
find path/to/folder \( -type f -o -type d \) -print0 | sort -z | xargs stat -c "%n %a"
. Let me edit the question... -
Mark Kreyman almost 8 yearsTo account for differences in sort algorithms between my Mac and RHEL 5.x server, I had to slightly modify the command:
find ./folder -type f -print0 | xargs -0 sha1sum | sort -df | sha1sum
-
Dave C over 7 yearsI'm unsure why this doesn't have more upvotes given the simplicity of the solution. Can anyone explain why this wouldn't work well?
-
Ryota over 7 yearsI suppose this isn't ideal as the generated hash will be based on file owner, date-format setup, etc.
-
Shumoapp over 7 yearsThe ls command can be customized to output whatever you want. You can replace -l with -gG to omit the group and the owner. And you can change the date format with the --time-style option. Basically check out the ls man page and see what suits your needs.
-
Kasun Siyambalapitiya almost 7 years@S.Lott if the directory size is big, I mean if the size of the directory is so big, zipping it and getting md5 on it will take more time
-
Zoltan about 6 yearsThe git hash is not suitable for this purpose since file contents are only a part of its input. Even for the initial commit of a branch, the hash is affected by the commit message and the commit metadata as well, like the time of the commit. If you commit the same directory structure multiple times, you will get different hash every time, thus the resulting hash is not suitable for determining whether two directories are exact copies of each other by only sending the hash over.
-
Navin almost 6 years@DaveC Because it's pretty much useless. If you want to compare filenames, just compare them directly. They're not that big.
-
yashma almost 6 years@Navin From the question it is not clear whether it is necessary to hash file contents or detect a change in a tree. Each case has its uses. Storing 45K filenames in a kernel tree, for example, is less practical than a single hash. ls -lAgGR --block-size=1 --time-style=+%s | sha1sum works great for me
-
Bernard about 5 yearsBe careful with find. Running the script on
find /some/path/dir1 -type f ...
andfind /someother/path/dir2 -type f ...
will return different checksums even if the content of dir1 and dir2 is identical. You need tocd /some/path/dir1
before callingfind . -type f ...
-
hobbs about 5 years@Zoltan the git hash is perfectly fine, if you use a tree hash and not a commit hash.
-
Zoltan about 5 years@hobbs The answer originally stated "commit hash", which is certainly not fit for this purpose. The tree hash sounds like a much better candidate, but there could still be hidden traps. One that comes to my mind is that having the executable bit set on some files changes the tree hash. You have to issue
git config --local core.fileMode false
before committing to avoid this. I don't know whether there are any more caveats like this. -
thinktt over 4 yearsI'm having an issue where the xargs output, the list of hashes for my files, are not reliably coming out in the same order. Any idea why that might be happening? Could it be an issue with the sort command?
-
thinktt over 4 yearsThis seems much simpler than the accepted answer for hashing a directory. I wasn't finding the accepted answer reliable. One issue... is there a chance the hashes could come out in a different order?
sha256sum /tmp/thd-agent/* | sort
is what i'm trying for a reliable ordering, then just hashing that. -
NVRM over 4 yearsHi, looks like the hashes comes in alphabetical order by default. What do you mean by reliable ordering? You have to organize all that by yourself. For example using associative arrays, entry + hash. Then you sort this array by entry, this gives a list of computed hashes in the sort order. I believe you can use a json object otherwise, and hash the whole object directly.
-
Vatine over 4 years@thinktt No obvious idea why. You could try replacing
xargs
withecho
to check that the arguments are being passed through in a consistent order. Also remember that you (probably) want to ensure you're not using any localisation for sorting. -
thinktt over 4 yearsIf I understand you're saying it hashes the files in alphabetical order. That seems right. Something in the accepted answer above was giving me intermittent different orders sometimes, so I'm just trying to make sure that doesn't happen again. I'm going to stick with putting sort at the end. Seems to be working. Only issue with this method vs accepted answer I see is it doesn't deal with nested folders. In my case I don't have any folders so this works great.
-
NVRM over 4 yearswhat about
ls -r | sha256sum
? -
Ferit about 4 yearsCan you give a brief example to get a robust and clean sha256 of a folder, maybe for a Windows folder with three subdirectories and a few files in there each?
-
John McGehee almost 4 yearsFor many applications this approach is superior. Hashing just the source code files gets a sufficiently unique hash in a lot less time.
-
Gi0rgi0s almost 4 years@NVRM tried it and it just checked for file name changes, not the file content
-
Andrew Klaassen over 3 years@nos: With recent versions of GNU tar, sort order can be enforced with --sort=name.
-
Andrew Klaassen over 3 yearsThis is the best answer involving GNU tar, since it ensures that file contents and directory structure are consistently compared.
-
tcrafton about 3 yearsWarning: not all versions of tar have --sort :-(
-
tcrafton about 3 yearsthis needs a | sort before the last sha1sum to get consistent results (unless tqdm takes care of that? I didn't test with tqdm)
-
Sergey Lukin over 2 yearsfinally something that is consistent across environments. THANKS!!
-
FarisHijazi over 2 yearsthat's correct I just added that without seeing your commend, and now I wish I saw yours before.
-
Torsten Bronger over 2 yearsNote that even for the highly regarded rsync, comparing timestamps and file sizes is sufficient by default.
-
Ejdrien over 2 yearsThis does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From Review
-
Gabriel Staples about 2 yearsThis answer doesn't produce identical hashes for identical folders in different locations on your file system. That's a big short-coming. I explain why, and present a fix to it, as well as two bash functions I wrote:
sha256sum_dir
anddiff_dir
, in my new answer here. -
M Imam Pratama about 2 yearsUse
shopt -s globstar
, so we can do it recursively:sha1sum path/to/folder/** | sha1sum