Can CRC32 be used as a hash function?

48,911

Solution 1

CRC32 works very well as a hash algorithm. The whole point of a CRC is to hash a stream of bytes with as few collisions as possible. That said, there are a few points to consider:

  • CRC's are not secure. For secure hashing you need a much more computationally expensive algorithm.

  • Different CRC flavors exist with different properties. Make sure you use the right algorithm, e.g. with hash polynomial 0x11EDC6F41 (CRC32C) which is the optimal general purpose choice.

  • As a hashing speed/quality trade-off, the x86 CRC32 instruction is tough to beat. However, this instruction doesn't exist in older CPU's so beware of portability problems.

---- EDIT ----

Mark Adler provided a link to a useful article for hash evaluation by Bret Mulvey. Using the source code provided in the article, I ran the "bucket test" for both CRC32C and Jenkins96. These tables show the probability that a truly uniform distribution would be worse than the measured result by chance alone. So, higher numbers are better. The author considered 0.05 or lower to be weak and 0.01 or lower to be very weak. I'm entirely trusting the author on all this and am just reporting results.

I placed an * by all the instances where CRC32C performed better than Jenkins96. By this simple tally, CRC32C was a more uniform hash than Jenkins96 54 of 96 times. Especially if you can use the x86 CRC32 instruction, the speed quality trade-off is excellent.

CRC32C (0x1EDC6F41)

       Uniform keys        Text keys         Sparse keys

Bits  Lower    Upper     Lower    Upper     Lower    Upper
 1    0.671   *0.671    *1.000    0.120    *0.572   *0.572
 2   *0.706   *0.165    *0.729   *0.919     0.277    0.440
 3   *0.878   *0.879    *0.556    0.362    *0.535   *0.542
 4    0.573    0.332     0.433    0.462    *0.855    0.393
 5    0.023   *0.681     0.470    0.907     0.266    0.059
 6   *0.145   *0.523     0.354   *0.172    *0.336    0.588
 7    0.424    0.722     0.172   *0.736     0.184   *0.842
 8   *0.767    0.507    *0.533    0.437     0.337    0.321
 9    0.480    0.725    *0.753   *0.807    *0.618    0.025
10   *0.719    0.161    *0.970   *0.740    *0.789    0.344
11   *0.610    0.225    *0.849   *0.814    *0.854   *0.003
12   *0.979   *0.239    *0.709    0.786     0.171   *0.865
13   *0.515    0.395     0.192    0.600     0.869   *0.238
14    0.089   *0.609     0.055   *0.414    *0.286   *0.398
15   *0.372   *0.719    *0.944    0.100    *0.852   *0.300
16    0.015   *0.946    *0.467    0.459     0.372   *0.793

And for Jenkins96, which the author of article considered to be an excellent hash:

Jenkins96

      Uniform keys         Text keys         Sparse keys

Bits  Lower    Upper     Lower    Upper     Lower    Upper
 1    0.888    0.572     0.090    0.322     0.090    0.203
 2    0.198    0.027     0.505    0.447     0.729    0.825
 3    0.444    0.510     0.360    0.444     0.467    0.540
 4    0.974    0.783     0.724    0.971     0.439    0.902
 5    0.308    0.383     0.686    0.940     0.424    0.119
 6    0.138    0.505     0.907    0.103     0.300    0.891
 7    0.710    0.956     0.202    0.407     0.792    0.506
 8    0.031    0.552     0.229    0.573     0.407    0.688
 9    0.682    0.990     0.276    0.075     0.269    0.543
10    0.382    0.933     0.038    0.559     0.746    0.511
11    0.043    0.918     0.101    0.290     0.584    0.822
12    0.895    0.036     0.207    0.966     0.486    0.533
13    0.290    0.872     0.902    0.934     0.877    0.155
14    0.859    0.568     0.428    0.027     0.136    0.265
15    0.290    0.420     0.915    0.465     0.532    0.059
16    0.155    0.922     0.036    0.577     0.545    0.336

---- EDIT ---- Fixed out-of-date links and minor cleanup.

Solution 2

I don't know why Mark Adler said that "crc32 poorly distributes the input bits to hash". There is no single bit in the crc32 hash that is exactly equal to the input bits. Any bit of the hash is a linear combination of the input bits. Secondly, crc always evenly map the same number of different of input sequences to a given hash value. For example, if you have 1000 bits long message, after crc32, you can always find 2^(1000-32) sequences that produce a given hash value, no more, no less.

If you do not need the security feature, crc can serve as hash perfectly.

Actually, I think other non-secure hash functions may be simpler than crc, if you need to have a longer crc, for example crc-256.

Share:
48,911
Pradyot
Author by

Pradyot

Updated on July 09, 2022

Comments

  • Pradyot
    Pradyot almost 2 years

    Can CRC32 be used as a hash function? Any drawbacks to this approach? Any tradedeoffs?

  • Mark Adler
    Mark Adler about 12 years
    No, CRC does not avoid collisions as well as other algorithms. See home.comcast.net/~bretm/hash .
  • srking
    srking about 12 years
    @Mark, The author did not using the CRC32C polynomial. CRC32C works just fine as a hash for bucketizing strings of bytes in his test program.
  • Mark Adler
    Mark Adler about 12 years
    Good research! +1. However I still don't think that even with a crc32 instruction, it will beat hash algorithms designed for the purpose of (non-cryptographic) hashing. You can find some more advanced hash algorithm development and testing here: code.google.com/p/smhasher .
  • srking
    srking about 12 years
    @Mark, thanks for another good link. I don't know about the hash quality, but the author quotes ~29 clocks per 16 bytes for murmerhash3, vs. CRC32C instruction which is ~6 clocks per 16 bytes. I'm sticking to my story that CRC32C instruction is the sweet spot.
  • Mark Adler
    Mark Adler about 12 years
    That is darned fast, and may work just fine, depending on the application for the hash. All CRCs, regardless of the polynomial, dramatically fail the avalanche hash test. See this page from the first link I provided: home.comcast.net/~bretm/hash/8.html .
  • Nico Erfurth
    Nico Erfurth almost 10 years
    Just as a sidenote, Bret Mulvey moved that site some months ago to: bretmulvey.com/hash
  • RubenLaguna
    RubenLaguna about 9 years
    the Intel CRC32 instruction has a latency of 3 and throughput of 1 and can take 8 bytes at time (if I understood right the documentation). That means that you can get 8 bytes per cycle (if you have sufficiently long input). For smaller outputs 8 bytes = 4 cycles (3 + 1) , 16 bytes = 5 cycles (3 +1), 24 bytes = 6 (3+3), and so on
  • RubenLaguna
    RubenLaguna about 9 years
    after researching a little bit more the 3 cycles latency means that if the next CRC32 instruction depends on the output of the first then it has to wait. That's why you often see that the input is splitted into 3 chunks and all the chunk processed in parallel, so that those 3 instruction don't depend on each other and therefore the pipeline can be kept full.
  • Max Galkin
    Max Galkin about 8 years
    SMHasher now has a fork on Github: github.com/rurban/smhasher Interestingly, CRC is mentioned in the section with problematic hash functions with this note: "insecure, 100% bias, collisions, distrib". I'm not sure what that means :)
  • i336_
    i336_ about 7 years
    @NicoErfurth: Thanks from the future, the Web Archive didn't snag the page before it disappeared.
  • bryc
    bryc over 6 years
    I believe he said that because CRC fails statistical randomness tests - uniformly distributed across the code range, no bias toward certain bits.
  • Mark Adler
    Mark Adler over 5 years
    Still no. Both CRC-32 and CRC-32C fail the avalanche test dramatically.
  • Lorenz
    Lorenz almost 4 years
    The page has moved again and low lives at papa.bretmulvey.com/post/124027987928/hash-functions
  • abel1502
    abel1502 about 3 years
    I can support that crc32 (including c) is rather bad in terms of collisions: I have just done some tests of my hashtable on a 150k word english dictionary and 300k capacity, and crc32 gave on average 1140 collisions per hash, whereas fnv-1a only gave 1.5
  • Gabriel Staples
    Gabriel Staples over 2 years
    @srking, please check my edit and add/update the links.
  • srking
    srking over 2 years
    @GabrielStaples Done, I think.