What is the fastest hash algorithm to check if two files are equal?

77,180

Solution 1

One approach might be to use a simple CRC-32 algorithm, and only if the CRC values compare equal, rerun the hash with a SHA1 or something more robust. A fast CRC-32 will outperform a cryptographically secure hash any day.

Solution 2

Unless you're using a really complicated and/or slow hash, loading the data from the disk is going to take much longer than computing the hash (unless you use RAM disks or top-end SSDs).

So to compare two files, use this algorithm:

  • Compare sizes
  • Compare dates (be careful here: this can give you the wrong answer; you must test whether this is the case for you or not)
  • Compare the hashes

This allows for a fast fail (if the sizes are different, you know that the files are different).

To make things even faster, you can compute the hash once and save it along with the file. Also save the file date and size into this extra file, so you know quickly when you have to recompute the hash or delete the hash file when the main file changes.

Solution 3

xxhash purports itself as quite fast and strong, collision-wise:

http://cyan4973.github.io/xxHash/

There is a 64 bit variant that runs "even faster" on 64 bit processors than the 32, overall, though slower on 32-bit processors (go figure).

http://code.google.com/p/crcutil is also said to be quite fast (and leverages hardware CRC instructions where present, which are probably very fast, but if you don't have hardware that supports them, aren't as fast). Don't know if CRC32c is as good of a hash (in terms of collisions) as xxHash or not...

https://code.google.com/p/cityhash/ seems similar and related to crcutil [in that it can compile down to use hardware CRC32c instructions if instructed].

If you "just want the fastest raw speed" and don't care as much about quality of random distribution of the hash output (for instance, with small sets, or where speed is paramount), there are some fast algorithms mentioned here: http://www.sanmayce.com/Fastest_Hash/ (these "not quite random" distribution type algorithms are, in some cases, "good enough" and very fast). Apparently FNV1A_Jesteress is the fastest for "long" strings, some others possibly for small strings. http://locklessinc.com/articles/fast_hash/ also seems related. I did not research to see what the collision properties of these are.

Latest hotness seems to be https://github.com/erthink/t1ha and https://github.com/wangyi-fudan/wyhash and xxhash also has a slightly updated version as well.

Solution 4

What we are optimizing here is time spent on a task. Unfortunately we do not know enough about the task at hand to know what the optimal solution should be.

Is it for one-time comparison of 2 arbitrary files? Then compare size, and after that simply compare the files, byte by byte (or mb by mb) if that's better for your IO.

If it is for 2 large sets of files, or many sets of files, and it is not a one-time exercise. but something that will happen frequently, then one should store hashes for each file. A hash is never unique, but a hash with a number of say 9 digits (32 bits) would be good for about 4 billion combination, and a 64 bit number would be good enough to distinguish between some 16 * 10^18 Quintillion different files.

A decent compromise would be to generate 2 32-bit hashes for each file, one for first 8k, another for 1MB+8k, slap them together as a single 64 bit number. Cataloging all existing files into a DB should be fairly quick, and looking up a candidate file against this DB should also be very quick. Once there is a match, the only way to determine if they are the same is to compare the whole files.

I am a believer in giving people what they need, which is not always never what they think they need, or what the want.

Solution 5

You could try MurmurHash, which was specifically designed to be fast, and is pretty simple to code. You might want to and a second, more secure hash if MurmurHash returns a match though, just to be sure.

Share:
77,180
Ashu
Author by

Ashu

Updated on July 05, 2022

Comments

  • Ashu
    Ashu almost 2 years

    What is the fastest way to create a hash function which will be used to check if two files are equal?

    Security is not very important.

    Edit: I am sending a file over a network connection, and will be sure that the file on both sides are equal

  • Ashu
    Ashu over 14 years
    I am sending a file over a network connection, and will be sure that the file on both sides are equal.
  • Greg Hewgill
    Greg Hewgill over 14 years
    Oh, well in that case just use a real hash algorithm. I guarantee your network will be slower than the hash.
  • tster
    tster over 14 years
    In such a case, use an already existing hash function. Greg, posted some good examples.
  • Steven Sudit
    Steven Sudit almost 14 years
    I'd say that hashing a file is likely to be I/O-bound anyhow, so you might as well use a hash with good distribution and a large range (certainly any crypto hash qualifies).
  • Steven Sudit
    Steven Sudit almost 14 years
    The OP stated that security was not a consideration here, so I'm not sure why a second hash would help. Instead, I'd suggest using one of the 64-bit variants of Murmur.
  • Steven Sudit
    Steven Sudit almost 14 years
    I would not recommend Adler32 for any purpose. It has terrible characteristics, particularly for short files.
  • Steven Sudit
    Steven Sudit over 13 years
    I'm going to contradict myself here: if there are just two files of equal length, you're not going to get any faster with hashes than by direct comparison. If you have a number of files and want to find candidates for equality, a hash makes sense.
  • Steven Sudit
    Steven Sudit over 13 years
    I'm going to contradict myself by suggesting that the newer 128-bit variant is better, and then contradict myself by adding that, for this use case, I'd stick with a proper crypto hash, such as SHA-256.
  • Steven Sudit
    Steven Sudit over 13 years
    There are faster algorithms that are nonetheless much better. MurmurHash3 comes to mind, but for this use case, I'd suggest that I/O speed is the limit so SHA-256 would be good.
  • Steven Sudit
    Steven Sudit over 13 years
    (Also, please use the comment option instead of editing your remark, else I'll only know about your response if I get lucky.)
  • Steven Sudit
    Steven Sudit over 13 years
    I've implemented a working solution that uses alternate data streams under NTFS to store hashes. One thing I had to do, however, was to timestamp the hash so that I could tell if the file had been modified since it had last been hashed.
  • rogerdpack
    rogerdpack about 12 years
    +1 for CRC, since the OP asked for "fastest". Of course, then he asked for "making sure the files are the same" which contradicts itself LOL.
  • rogerdpack
    rogerdpack about 12 years
    apparently adler32 is "bad for numbers" strchr.com/hash_functions but CRC32 is ok, at least distribution wise.
  • rogerdpack
    rogerdpack about 12 years
    cbloomrants.blogspot.com/2010/08/08-21-10-adler32.html and strchr.com/hash_functions seem to imply that murmurhash is faster, of only slightly, than adler/crc32. It may all depend on implementation, for instance this sse version says it is a "fast" crc-like hash: cessu.blogspot.com/2008/11/…
  • Bigue Nique
    Bigue Nique about 10 years
    rsync is actually using a "rolling checksum" version of the Adler32 algorithm, as of Wikipedia: en.wikipedia.org/wiki/Adler-32
  • OneOfOne
    OneOfOne almost 10 years
    @rogerdpack crc isn't close to fastest hash, even with asm.
  • rogerdpack
    rogerdpack almost 10 years
    @OneOfOne true I believe I didn't realize that at the time. These days I recommend xxhash or cityhash, see my other answer here stackoverflow.com/a/11422479/32453 [apparently with crc32c it can compile down to a CPU instruction which is very fast...though that's not what I was referring to initially here I don't think so your comment is right]
  • Flimzy
    Flimzy over 8 years
    If you're comparing files over a network (as the OP is), then reading each file amounts to re-transmitting the file over the network a second time. So using some sort of hashing probably makes sense. But I would agree with using a good hashing algorithm the first time, rather than doing a preliminary CRC32 followed by something else.
  • Abhi Beckert
    Abhi Beckert almost 7 years
    @StevenSudit it's not IO bound on a fast SSD. I've got a test file where md5 takes a minute but my SSD can read the file in just 25 seconds. And my SSD is a few years old, you can get faster ones now.
  • Abhi Beckert
    Abhi Beckert almost 7 years
    Fast disks today can read at 2.5GB per second. Hashes are nowhere near that fast in my experience.
  • Aaron Digulla
    Aaron Digulla almost 7 years
    @AbhiBeckert My argument is: If you have the hashes computed, you don't need to load the whole data set. Also my first sentence is "Unless you're using a really complicated and/or slow hash", isn't it?
  • Abhi Beckert
    Abhi Beckert almost 7 years
    @AaronDigulla in my case, I'm wanting to check if the contents of a large list of files still match their previously calculated hash, so it needs to be re-calculated. Using sha1 and a fast SSD and a large list of files, hash calculation is pinning all my CPU cores at 100% for an hour or two, causing fans to spin up to maximum speed and clock speed to be throttled to prevent overheating and so on and so on. I came here to find a more efficient hash. I don't think sha1 is complicated or slow as far as strong hashes go, although "really" is a relative term. I tried MD5 with similar results.
  • Aaron Digulla
    Aaron Digulla almost 7 years
    @AbhiBeckert I see. SHA and MD were designed with crypto in mind (security is more important than speed). This questions might help: softwareengineering.stackexchange.com/questions/49550/…
  • Abhi Beckert
    Abhi Beckert almost 7 years
    @AaronDigulla I'm aware of that. My point is that you should clarify your answer since it's currently wrong. Almost all hashes are designed for crypto and they can all be significantly slower than just reading the contents from disk and doing a direct comparison of every byte. Instead of "Unless you're using a really complicated/slow hash ..." your answer should say "if you use X, Y or Z fast hash ...". Also, many of those faster hash functions have unacceptably high collision risks for any file large enough to care about speed. Perhaps add a "compare contents" step where the hashes match.
  • Aaron Digulla
    Aaron Digulla almost 7 years
    @AbhiBeckert When I wrote that answer, SSDs were rare and not that fast. Typically, your CPU cache was 100-1000 times faster than main memory and memory was 1000 times faster than loading data from hard disk. This has changed. That said, your use case is different from this question. Sending the data over the network is very, very slow. Even a very slow hash will outrun gigabit ethernet hands down. I suggest to ask a new question.
  • Ben Personick
    Ben Personick about 5 years
    "There is a 64 bit variant that runs "even faster" on 64 bit processors than the 32, overall, though slower on 32-bit processors (go figure)." - okay, I figure the 64bit code is optimized for 64bit processors and is using 64bit long integers for chunking the hashing mechanism.
  • hmijail
    hmijail over 4 years
    Even if just comparing locally, if the only result needed is "equal" / "not equal", it probably still makes sense to hash, because that allows the drive/OS to read the file as fast as possible, instead of alternating chunks between 2 files.
  • warren
    warren over 2 years
    @Flimzy - OP seems to be describing running the hash locally on each side of the connection, which would not amount "to re-transmitting the file over the network a second time" :)
  • warren
    warren over 2 years
    @BenPersonick - it would make sense that a 64-bit version would run slower, all other things being equal, on a 32-bit processor than on a 64-bit one... the 32-bit processor will have to bust the 64-bit block size into two pieces instead running it at once :)
  • Flimzy
    Flimzy over 2 years
    @warren: I believe that 5 years ago when I wrote that, I was responding to the immediately preceeding comment which suggested that the fastest approach is simply to read each file and do a direct comparison. I believe the point I was trying to make was that due to the cost of network latency, a hash is probably more time efficient.
  • Ben Personick
    Ben Personick over 2 years
    @warren Exactly right that would be the case if possible on a 32bit CPU, however you cant run 64 bit code on a 32bit CPU. I believe he means that running 64 bit code on a 64bit CPU is running faster than running a 32bit version of the program on a 64bit CPU. Thats to be expected as this is a data crunching program so using the larger native 64bit variables would allow quicker action by manipulating 64 bit chunks of data, instead of double the number of 32bit chunks of data. :)
  • warren
    warren over 2 years
    @BenPersonick - you can run 256-bit algorithms on a 64-bit processor (eg SHA256). It certainly possible to run 64-bit algorithms on a 32-bit processor (MD5's been around for a lot longer than consumer-grade 64-bit CPUs, and it's a 128-bit algorithm). It makes sense running a "natively-sized" algorithm is going to be fast than one that's not natively-sized :)
  • Mark
    Mark over 2 years
    xxh128sum is so much faster for file comparison than using a cryptographic hash function like sha1... but I think Aaron implies that in his answer, it's just not explicit. However, for less experienced users, most pre-installed hash command line tools will be optimised for cryptographic security and not speed. xxHash is available in many repositories including Ubuntu (sudo apt install xxhash) and OpenBSD (doas pkg_add -U xxhash)