Linux: compute a single hash for a given folder & contents?

77,861

Solution 1

One possible way would be:

sha1sum path/to/folder/* | sha1sum

If there is a whole directory tree, you're probably better off using find and xargs. One possible command would be

find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum

And, finally, if you also need to take account of permissions and empty directories:

(find path/to/folder -type f -print0  | sort -z | xargs -0 sha1sum;
 find path/to/folder \( -type f -o -type d \) -print0 | sort -z | \
   xargs -0 stat -c '%n %a') \
| sha1sum

The arguments to stat will cause it to print the name of the file, followed by its octal permissions. The two finds will run one after the other, causing double the amount of disk IO, the first finding all file names and checksumming the contents, the second finding all file and directory names, printing name and mode. The list of "file names and checksums", followed by "names and directories, with permissions" will then be checksummed, for a smaller checksum.

Solution 2

  • Use a file system intrusion detection tool like aide.

  • hash a tar ball of the directory:

    tar cvf - /path/to/folder | sha1sum

  • Code something yourself, like vatine's oneliner:

    find /path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum

Solution 3

If you just want to check if something in the folder changed, I'd recommend this one:

ls -alR --full-time /folder/of/stuff | sha1sum

It will just give you a hash of the ls output, that contains folders, sub-folders, their files, their timestamp, size and permissions. Pretty much everything that you would need to determine if something has changed.

Please note that this command will not generate hash for each file, but that is why it should be faster than using find.

Solution 4

You can do tar -c /path/to/folder | sha1sum

Solution 5

So far the fastest way to do it is still with tar. And with several additional parameters we can also get rid of the difference caused by metadata.

To use tar for hash the dir, one need to make sure you sort the path during tar, otherwise it is always different.

tar -C <root-dir> -cf - --sort=name <dir> | sha256sum

ignore time

If you do not care about the access time or modify time also use something like --mtime='UTC 2019-01-01' to make sure all timestamp is the same.

ignore ownership

Usually we need to add --group=0 --owner=0 --numeric-owner to unify the owner metadata.

ignore some files

use --exclude=PATTERN

Share:
77,861
Ben L
Author by

Ben L

Professional experience in (ordered by recency): Clojure Java Python C# C++ Platform experience: Android, Unix, Windows, iOS.

Updated on July 21, 2022

Comments

  • Ben L
    Ben L almost 2 years

    Surely there must be a way to do this easily!

    I've tried the Linux command-line apps such as sha1sum and md5sum but they seem only to be able to compute hashes of individual files and output a list of hash values, one for each file.

    I need to generate a single hash for the entire contents of a folder (not just the filenames).

    I'd like to do something like

    sha1sum /folder/of/stuff > singlehashvalue
    

    Edit: to clarify, my files are at multiple levels in a directory tree, they're not all sitting in the same root folder.

  • leo7r
    leo7r over 15 years
    If you sort after the first sha1sum, then a LF in a filename should do no harm.
  • Aaron Digulla
    Aaron Digulla over 15 years
    Edited. Sort can work on 0 delimited lists with the -z option.
  • David Schmitt
    David Schmitt over 15 years
    and don't forget to set LC_ALL=POSIX, so the various tools create locale independent output.
  • slowdog
    slowdog over 13 years
    If you want to replicate that checksum on a different machine, tar might not be a good choice, as the format seems to have room for ambiguity and exist in many versions, so the tar on another machine might produce different output from the same files.
  • Bruno Bronosky
    Bruno Bronosky about 13 years
    I found cat | sha1sum to be considerably faster than sha1sum | sha1sum. YMMV, try each of these on your system: time find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum; time find path/to/folder -type f -print0 | sort -z | xargs -0 cat | sha1sum
  • mivk
    mivk about 12 years
    for F in 'find ...' ... doesn't work when you have spaces in names (which you always do nowadays).
  • Vatine
    Vatine over 11 years
    @RichardBronosky - Let us assume we have two files, A and B. A contains "foo" and B contains "bar was here". With your method, we would not be able to separate that from two files C and D, where C contains "foobar" and D contains " was here". By hashing each file individually and then hash all "filename hash" pairs, we can see the difference.
  • Bruno Bronosky
    Bruno Bronosky over 11 years
    +1 for the tar solution. That is the fastest, but drop the v. verbosity only slows it down.
  • robbles
    robbles over 11 years
    To make this work irrespective of the directory path (i.e. when you want to compare the hashes of two different folders), you need to use a relative path and change to the appropriate directory, because the paths are included in the final hash: find ./folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum
  • Vatine
    Vatine over 11 years
    @robbles That is correct and why I did not put an initial / on the path/to/folder bit.
  • hopla
    hopla over 11 years
    You could also have your hashtool print out only the hashes, on FreeBSD for example: xargs -0 sha256 -q (Also, in your anwser, you might want to draw attention to the fact that (absolute) filenames are printed out with the hashes)
  • Vatine
    Vatine over 11 years
    @hopla Relativified paths throughout instead of just in the final example.
  • nos
    nos over 11 years
    note that the tar soluition assumes the files are in the same order when you compare them. Whether they are would depend on the file system the files resides in when doing the comparison.
  • hopla
    hopla over 11 years
    Much clearer :) I've also been think that using relative paths is better than the -q option, because then all the file names are taken into account in the final hash as well, avoiding problems should a hash collision ever occur.
  • Mamoun Benghezal
    Mamoun Benghezal almost 9 years
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
  • Vatine
    Vatine over 8 years
    @JasonS Define "large"? You're looking at roughly linear run-time in pure data volume (consumed by sha1sum or equivalent hashing). You're looking at (roughly) linear performance from find. Sorting is probably o(n log n), file "number of files" as n. Until growth in "log n" starts being significant, time will be dominated by the disk bandwidth. Waving a hand vagiuely in the air, I'd say you'd be OK for "tens to hundreds of thousands of files". At some point, the list of hashes-per-file to sort may require spilling to disk, so there's going to be a vicious cliff in the time complexity curve.
  • Jason S
    Jason S over 8 years
    no, I'm worried about the large command-line; xargs makes a single call to sha1sum, right? is there a limit in command-line size?
  • Vatine
    Vatine over 8 years
    @JasonS Ah, no, the reason for xargs is that it intelligently splits the incoming stream of "filenames to hash" from find into suitable chunks (defaults to, um, something low (basically, it depends on the system, but the default should always be safe).
  • Binary Phile
    Binary Phile over 8 years
    slowdog's valid concerns notwithstanding, if you care about file contents, permissions, etc. but not modification time, you can add the --mtime option like so: tar -c /path/to/folder --mtime="1970-01-01" | sha1sum.
  • Binary Phile
    Binary Phile over 8 years
    While this command looks to work well for a certain use case, it doesn't seem to include what may be relevant details such as directory names as well as file permissions. I'm sure there's more than one way to skin the cat though.
  • Vatine
    Vatine over 8 years
    @BinaryPhile That is correct, but not what the question originally asked for. All directories with contents will have their names as part of the final hash, though (they're part of the file names). It would be possible to include the permissions, but would require (some) thought, as a plain "ls -l" would include date and time information that is (probably) not relevant.
  • CMCDragonkai
    CMCDragonkai over 8 years
    So this doesn't capture the permissions?
  • CMCDragonkai
    CMCDragonkai over 8 years
    This also doesn't capture empty directories.
  • Vatine
    Vatine over 8 years
    @CMCDragonkai No, it only captures file contents, making sure to respect file boundaries. If you also want to include permissions and empty directories, it would be possible to add something like find path/to/folder \( -type f -o -type d \) -print0 | sort -z | xargs stat -c "%n %a". Let me edit the question...
  • Mark Kreyman
    Mark Kreyman almost 8 years
    To account for differences in sort algorithms between my Mac and RHEL 5.x server, I had to slightly modify the command: find ./folder -type f -print0 | xargs -0 sha1sum | sort -df | sha1sum
  • Dave C
    Dave C over 7 years
    I'm unsure why this doesn't have more upvotes given the simplicity of the solution. Can anyone explain why this wouldn't work well?
  • Ryota
    Ryota over 7 years
    I suppose this isn't ideal as the generated hash will be based on file owner, date-format setup, etc.
  • Shumoapp
    Shumoapp over 7 years
    The ls command can be customized to output whatever you want. You can replace -l with -gG to omit the group and the owner. And you can change the date format with the --time-style option. Basically check out the ls man page and see what suits your needs.
  • Kasun Siyambalapitiya
    Kasun Siyambalapitiya almost 7 years
    @S.Lott if the directory size is big, I mean if the size of the directory is so big, zipping it and getting md5 on it will take more time
  • Zoltan
    Zoltan about 6 years
    The git hash is not suitable for this purpose since file contents are only a part of its input. Even for the initial commit of a branch, the hash is affected by the commit message and the commit metadata as well, like the time of the commit. If you commit the same directory structure multiple times, you will get different hash every time, thus the resulting hash is not suitable for determining whether two directories are exact copies of each other by only sending the hash over.
  • Navin
    Navin almost 6 years
    @DaveC Because it's pretty much useless. If you want to compare filenames, just compare them directly. They're not that big.
  • yashma
    yashma almost 6 years
    @Navin From the question it is not clear whether it is necessary to hash file contents or detect a change in a tree. Each case has its uses. Storing 45K filenames in a kernel tree, for example, is less practical than a single hash. ls -lAgGR --block-size=1 --time-style=+%s | sha1sum works great for me
  • Bernard
    Bernard about 5 years
    Be careful with find. Running the script on find /some/path/dir1 -type f ... and find /someother/path/dir2 -type f ... will return different checksums even if the content of dir1 and dir2 is identical. You need to cd /some/path/dir1 before calling find . -type f ...
  • hobbs
    hobbs about 5 years
    @Zoltan the git hash is perfectly fine, if you use a tree hash and not a commit hash.
  • Zoltan
    Zoltan about 5 years
    @hobbs The answer originally stated "commit hash", which is certainly not fit for this purpose. The tree hash sounds like a much better candidate, but there could still be hidden traps. One that comes to my mind is that having the executable bit set on some files changes the tree hash. You have to issue git config --local core.fileMode false before committing to avoid this. I don't know whether there are any more caveats like this.
  • thinktt
    thinktt over 4 years
    I'm having an issue where the xargs output, the list of hashes for my files, are not reliably coming out in the same order. Any idea why that might be happening? Could it be an issue with the sort command?
  • thinktt
    thinktt over 4 years
    This seems much simpler than the accepted answer for hashing a directory. I wasn't finding the accepted answer reliable. One issue... is there a chance the hashes could come out in a different order? sha256sum /tmp/thd-agent/* | sort is what i'm trying for a reliable ordering, then just hashing that.
  • NVRM
    NVRM over 4 years
    Hi, looks like the hashes comes in alphabetical order by default. What do you mean by reliable ordering? You have to organize all that by yourself. For example using associative arrays, entry + hash. Then you sort this array by entry, this gives a list of computed hashes in the sort order. I believe you can use a json object otherwise, and hash the whole object directly.
  • Vatine
    Vatine over 4 years
    @thinktt No obvious idea why. You could try replacing xargs with echo to check that the arguments are being passed through in a consistent order. Also remember that you (probably) want to ensure you're not using any localisation for sorting.
  • thinktt
    thinktt over 4 years
    If I understand you're saying it hashes the files in alphabetical order. That seems right. Something in the accepted answer above was giving me intermittent different orders sometimes, so I'm just trying to make sure that doesn't happen again. I'm going to stick with putting sort at the end. Seems to be working. Only issue with this method vs accepted answer I see is it doesn't deal with nested folders. In my case I don't have any folders so this works great.
  • NVRM
    NVRM over 4 years
    what about ls -r | sha256sum ?
  • Ferit
    Ferit about 4 years
    Can you give a brief example to get a robust and clean sha256 of a folder, maybe for a Windows folder with three subdirectories and a few files in there each?
  • John McGehee
    John McGehee almost 4 years
    For many applications this approach is superior. Hashing just the source code files gets a sufficiently unique hash in a lot less time.
  • Gi0rgi0s
    Gi0rgi0s almost 4 years
    @NVRM tried it and it just checked for file name changes, not the file content
  • Andrew Klaassen
    Andrew Klaassen over 3 years
    @nos: With recent versions of GNU tar, sort order can be enforced with --sort=name.
  • Andrew Klaassen
    Andrew Klaassen over 3 years
    This is the best answer involving GNU tar, since it ensures that file contents and directory structure are consistently compared.
  • tcrafton
    tcrafton about 3 years
    Warning: not all versions of tar have --sort :-(
  • tcrafton
    tcrafton about 3 years
    this needs a | sort before the last sha1sum to get consistent results (unless tqdm takes care of that? I didn't test with tqdm)
  • Sergey Lukin
    Sergey Lukin over 2 years
    finally something that is consistent across environments. THANKS!!
  • FarisHijazi
    FarisHijazi over 2 years
    that's correct I just added that without seeing your commend, and now I wish I saw yours before.
  • Torsten Bronger
    Torsten Bronger over 2 years
    Note that even for the highly regarded rsync, comparing timestamps and file sizes is sufficient by default.
  • Ejdrien
    Ejdrien over 2 years
    This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From Review
  • Gabriel Staples
    Gabriel Staples about 2 years
    This answer doesn't produce identical hashes for identical folders in different locations on your file system. That's a big short-coming. I explain why, and present a fix to it, as well as two bash functions I wrote: sha256sum_dir and diff_dir, in my new answer here.
  • M Imam Pratama
    M Imam Pratama about 2 years
    Use shopt -s globstar, so we can do it recursively: sha1sum path/to/folder/** | sha1sum