Why is XOR the default way to combine hashes?

67,322

Solution 1

Assuming uniformly random (1-bit) inputs, the AND function output probability distribution is 75% 0 and 25% 1. Conversely, OR is 25% 0 and 75% 1.

The XOR function is 50% 0 and 50% 1, therefore it is good for combining uniform probability distributions.

This can be seen by writing out truth tables:

 a | b | a AND b
---+---+--------
 0 | 0 |    0
 0 | 1 |    0
 1 | 0 |    0
 1 | 1 |    1

 a | b | a OR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    1

 a | b | a XOR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    0

Exercise: How many logical functions of two 1-bit inputs a and b have this uniform output distribution? Why is XOR the most suitable for the purpose stated in your question?

Solution 2

xor is a dangerous default function to use when hashing. It is better than and and or, but that doesn't say much.

xor is symmetric, so the order of the elements is lost. So "bad" will hash combine the same as "dab".

xor maps pairwise identical values to zero, and you should avoid mapping "common" values to zero:

So (a,a) gets mapped to 0, and (b,b) also gets mapped to 0. As such pairs are almost always more common than randomness might imply, you end up with far to many collisions at zero than you should.

With these two problems, xor ends up being a hash combiner that looks half decent on the surface, but not after further inspection.

On modern hardware, adding usually about as fast as xor (it probably uses more power to pull this off, admittedly). Adding's truth table is similar to xor on the bit in question, but it also sends a bit to the next bit over when both values are 1. This means it erases less information.

So hash(a) + hash(b) is better than hash(a) xor hash(b) in that if a==b, the result is hash(a)<<1 instead of 0.

This remains symmetric; so the "bad" and "dab" getting the same result remains a problem. We can break this symmetry for a modest cost:

hash(a)<<1 + hash(a) + hash(b)

aka hash(a)*3 + hash(b). (calculating hash(a) once and storing is advised if you use the shift solution). Any odd constant instead of 3 will bijectively map a "k-bit" unsigned integer to itself, as map on unsigned integers is math modulo 2^k for some k, and any odd constant is relatively prime to 2^k.

For an even fancier version, we can examine boost::hash_combine, which is effectively:

size_t hash_combine( size_t lhs, size_t rhs ) {
  lhs ^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
  return lhs;
}

here we add together some shifted versions of lhs with a constant (which is basically random 0s and 1s – in particular it is the inverse of the golden ratio as a 32 bit fixed point fraction) with some addition and an xor. This breaks symmetry, and introduces some "noise" if the incoming hashed values are poor (ie, imagine every component hashes to 0 – the above handles it well, generating a smear of 1 and 0s after each combine. My naive 3*hash(a)+hash(b) simply outputs a 0 in that case).

(For those not familiar with C/C++, a size_t is an unsigned integer value which is big enough to describe the size of any object in memory. On a 64 bit system, it is usually a 64 bit unsigned integer. On a 32 bit system, a 32 bit unsigned integer.)

Solution 3

In spite of its handy bit-mixing properties, XOR is not a good way to combine hashes due to its commutativity. Consider what would happen if you stored the permutations of {1, 2, …, 10} in a hash table of 10-tuples.

A much better choice is m * H(A) + H(B), where m is a large odd number.

Credit: The above combiner was a tip from Bob Jenkins.

Solution 4

Xor may be the "default" way to combine hashes but Greg Hewgill's answer also shows why it has its pitfalls: The xor of two identical hash values is zero. In real life, there are identical hashes are more common than one might have expected. You might then find that in these (not so infrequent) corner cases, the resulting combined hashes are always the same (zero). Hash collisions would be much, much more frequent than you expect.

In a contrived example, you might be combining hashed passwords of users from different websites you manage. Unfortunately, a large number of users reuse their passwords, and a surprising proportion of the resulting hashes are zero!

Solution 5

There's something I want to explicitly point out for others who find this page. AND and OR restrict output like BlueRaja - Danny Pflughoe is trying to point out, but can be better defined:

First I want to define two simple functions I'll use to explain this: Min() and Max().

Min(A, B) will return the value that is smaller between A and B, for example: Min(1, 5) returns 1.

Max(A, B) will return the value that is larger between A and B, for example: Max(1, 5) returns 5.

If you are given: C = A AND B

Then you can find that C <= Min(A, B) We know this because there is nothing you can AND with the 0 bits of A or B to make them 1s. So every zero bit stays a zero bit and every one bit has a chance to become a zero bit (and thus a smaller value).

With: C = A OR B

The opposite is true: C >= Max(A, B) With this, we see the corollary to the AND function. Any bit that is already a one cannot be ORed into being a zero, so it stays a one, but every zero bit has a chance to become a one, and thus a larger number.

This implies that the state of the input applies restrictions on the output. If you AND anything with 90, you know the output will be equal to or less than 90 regardless what the other value is.

For XOR, there is no implied restriction based on the inputs. There are special cases where you can find that if you XOR a byte with 255 than you get the inverse but any possible byte can be output from that. Every bit has a chance to change state depending on the same bit in the other operand.

Share:
67,322

Related videos on Youtube

Nate Murray
Author by

Nate Murray

Updated on March 27, 2021

Comments

  • Nate Murray
    Nate Murray over 3 years

    Say you have two hashes H(A) and H(B) and you want to combine them. I've read that a good way to combine two hashes is to XOR them, e.g. XOR( H(A), H(B) ).

    The best explanation I've found is touched briefly here on these hash function guidelines:

    XORing two numbers with roughly random distribution results in another number still with roughly random distribution*, but which now depends on the two values.
    ...
    * At each bit of the two numbers to combine, a 0 is output if the two bits are equal, else a 1. In other words, in 50% of the combinations, a 1 will be output. So if the two input bits each have a roughly 50-50 chance of being 0 or 1, then so too will the output bit.

    Can you explain the intuition and/or mathematics behind why XOR should be the default operation for combining hash functions (rather than OR or AND etc.)?

    • Massa
      Massa about 13 years
      I think you just did ;)
    • Thomas Pornin
      Thomas Pornin about 13 years
      note that XOR may or may not be a "good" way to "combine" hashes, depending on what you want in a "combination". XOR is commutative: XOR(H(A),H(B)) is equal to XOR(H(B),H(A)). This means that XOR is not a proper way to create a kind of hash of an ordered sequence of values, since it does not capture the order.
    • Andrei Galatyn
      Andrei Galatyn about 8 years
      Besides the issue with order (comment above), there is problem with equal values. XOR(H(1), H(1))=0 (for any function H), XOR(H(2),H(2))=0 and so on. For any N: XOR(H(N),H(N))=0. Equal values happens quite often in real apps, it means result of XOR will be 0 too often to be considered as good hash.
    • Alexis
      Alexis about 7 years
      What do you use for ordered sequence of values ? Let's say I'd like to create a hash of timestamp or index. (MSB less important than LSB). Sorry if this thread is 1year old.
    • aelveborn
      aelveborn about 7 years
    • Dan Stahlke
      Dan Stahlke over 5 years
      A word of warning: don't use XOR to combine CRC values because CRC is a linear function in the sense that CRC(a) ^ CRC(b) = CRC(a ^ b). Additionally, two equal elements will cancel out. I think summing CRC values (with addition) is okay if you want a hash of an unordered list, but I'm not 100% on that.
  • Massa
    Massa about 13 years
    answering to the exercise: from the 16 possible different a XXX b operations (0, a & b, a > b, a, a < b, b, a % b, a | b, !a & !b, a == b, !b, a >= b, !a, a <= b, !a | !b, 1), the following have 50%-50% distributions of 0s and 1s, assuming a and b have 50%-50% distributions of 0s and 1s: a, b, !a, !b, a % b, a == b, i. e., the opposite of XOR (EQUIV) could have been used as well...
  • Nate Murray
    Nate Murray about 13 years
    Greg, this is an awesome answer. The light bulb went on for me after I saw your original answer and wrote out my own truth tables. I considered @Massa's answer about how there are 6 suitable operations for maintaining the distribution. And while a, b, !a, !b will have the same distribution as their respective inputs, you lose the entropy of the other input. That is, XOR is most suitable for the purpose of combining hashes because we want to capture entropy from both a and b.
  • Paŭlo Ebermann
    Paŭlo Ebermann almost 13 years
    One could say that OR is bitwise max, and AND is bitwise min.
  • Corey Ogburn
    Corey Ogburn almost 13 years
    Very well stated Paulo Ebermann. Nice to see you here as well as Crypto.SE!
  • Paŭlo Ebermann
    Paŭlo Ebermann almost 13 years
    I created a filter which includes me everything tagged cryptography, also changes to old questions. This way I found your answer here.
  • Tamás Szelei
    Tamás Szelei almost 12 years
    Here is a paper that explains that combining hashes securely where each function is called only once is not possible without outputting less bits than the sum of number of bits in each hash value. This suggest that this answer is not correct.
  • Alex S
    Alex S about 11 years
    @fish: That paper describes building secure hashes from a secure/possibly-insecure pair. I saw nothing about combining two secure hashes. In any event, I think this discussion has more to do with the use of hashes in randomised algorithms (where there are numerous good tricks that will do the job) than in cryptography, where a huge amount of care must be taken to thwart cryptanalysis.
  • supercat
    supercat over 10 years
    Sometimes commutativity is a good thing, but xor is a lousy choice even then because all pairs of matching items will get hashed to zero. An arithmetic sum is better; the hash of a pair of matching items will retain only 31 bits of useful data rather than 32, but that's a lot better than retaining zero. Another option may be to compute the arithmetic sum as a long and then munge the upper portion back in with the lower portion.
  • StefanKarpinski
    StefanKarpinski about 10 years
    m = 3 is actually a good choice and very fast on many systems. Note that for any odd m integer multiplication is modulo 2^32 or 2^64 and is therefore invertible so you're not losing any bits.
  • disruptive
    disruptive about 10 years
    What happens when you go beyond MaxInt?
  • Buge
    Buge almost 10 years
    @Massa I've never seen % used for XOR or not equal.
  • TermoTux
    TermoTux almost 10 years
    instead of any odd number one should choose a prime
  • user60561
    user60561 over 9 years
    I hope the contrived example never happens, passwords should be salted.
  • Dave
    Dave over 8 years
    Nice answer Yakk. Does this algorithm work equally well on both 32bit and 64bit systems? Thanks.
  • Yakk - Adam Nevraumont
    Yakk - Adam Nevraumont over 8 years
    @dave add more bits to 0x9e3779b9.
  • Dave
    Dave over 8 years
    @Yakk Thanks. For anyone else listening, I doubled the binary bits of the 32bit case ( 0x9e3779b9 ) for a 64bit value of ( 0x9e3779b99e377800 ) and switch which to use by testing cpp macros i386 (32 bit intel) and x86_64 (64 bit intel)
  • Yakk - Adam Nevraumont
    Yakk - Adam Nevraumont over 8 years
    @dave use a base 2 fractional irrational value for max entropy.
  • Dave
    Dave over 8 years
    @Yakk Oh! Thank you, I'd forgotten the number wasn't just any constant, which you explained so well above. :) Using your inverse of the golden ratio as a 64 bit fixed point number, I come up with this, which I'll use instead for the 64bit case: 0x9e3779b97f492000. Does it matter that this constant is even? Would it be better to add a one to the end of it?
  • Yakk - Adam Nevraumont
    Yakk - Adam Nevraumont over 8 years
    @Dave Not sure; but it ending with 000 is suspicious; that value probably has double bits of precision, not 64.
  • Dave
    Dave over 8 years
    @Yakk I used a couple online converters to come up with the numbers (probably written in javascript), so hmm... you're right, I doubt anyone is trying to be more precise than double. I'll re-examine. Also, oops stack overflow formatting prints the macros wrong in my earlier comment. They are __i386__ and __x86_64__ (with leading and trailing double-underlines)
  • Dave
    Dave over 8 years
    OK, to be complete... here is the full precision 64bit constant (calculated with long doubles, and unsigned long longs): 0x9e3779b97f4a7c16. Interestingly it is still even. Re-doing the same calculation using PI instead of the Golden Ratio produces: 0x517cc1b727220a95 which is odd, instead of even, thus probably "more prime" than the other constant. I used: std::cout << std::hex << (unsigned long long) ((1.0L/3.14159265358979323846264338327950288419716939937510L‌​)*(powl(2.0L,64.0L))‌​) << std::endl; with cout.precision( numeric_limits<long double>::max_digits10 ); Thanks again Yakk.
  • tpk
    tpk almost 8 years
    @GregHewgill, I know this thread is old; trying my luck. will XOR(A,B) will generate a unique bit sequence if A and B are unique and have same length?
  • Greg Hewgill
    Greg Hewgill almost 8 years
    @tpk: No, the result is not unique. There are many different ways to generate a given result R from R = A XOR B. For example, consider 0010 XOR 1100, and 1111 XOR 0001. Both give the result 1110.
  • Drew Noakes
    Drew Noakes over 7 years
    As Yakk points out, XOR can be dangerous as it produces zero for identical values. This means (a,a) and (b,b) both produce zero, which in many (most?) cases greatly increases the likelihood of collisions in hash-based data structures.
  • Drew Noakes
    Drew Noakes over 7 years
    @2943 consider XORing two bytes has 256*256 possible input values, and only 256 output values. It's not possible to come up with a unique output given two inputs, assuming all three values have the same options.
  • Scott Carey
    Scott Carey over 7 years
    @Dave the inverse golden ratio rule for these cases is the first odd number equal to or larger than the calculation you are doing. So just add 1. It is an important number because the sequence of N * the ratio, mod the max size (2^64 here) places the next value in the sequence exactly at that ratio in the middle of the largest 'gap' in numbers. Search the web for "Fibonacci hashing" for more info.
  • Scott Carey
    Scott Carey over 7 years
    XOR is fine, if you are combining two different quality hash functions. For example SHA1(A) XOR SipHash(B) (mashed together at equal length, of course)
  • Scott Carey
    Scott Carey over 7 years
    Java's default "multply by 31 and add / accumulate" hash is loaded with collisions (e.g. any string collides with string + "AA" IIRC) and they long ago wished they had not baked in that algorithm into the spec. That said, using a larger odd number with more bits set, and adding a shifts or rotations fixes that problem. MurmurHash3's 'mix' does this.
  • Alex S
    Alex S over 7 years
    @Infinum that's not necessary when combining hashes.
  • migle
    migle over 6 years
    This is not really a very good answer. It addresses the matter probabilistically, without considering cross probabilities. The question was "Why is XOR the default way to combine hashes", and XOR shouldn't be the default, because there will probably be a relation between the two values (two small integers, two letters, etc). And it gets a lot worse if more than two hashes are being combined.
  • migle
    migle over 6 years
    @Dave the right number would be 0.9E3779B97F4A7C15F39... See link. You're could be suffering from the round-to-even rule (which is good for accountants), or simply, if you start with a literal sqrt(5) constant, when you subtract 1, you remove the high order bit, a bit must have been lost.
  • migle
    migle over 6 years
    Also good, but a lot more expensive, would be hash(hash(a)) + hash(b).
  • Dave
    Dave over 6 years
    But, wait, is it 0x0.9e377... or 0x9e377 ? Sorry getting confused since the 32bit version int the main answer uses 0x9e377...
  • Yakk - Adam Nevraumont
    Yakk - Adam Nevraumont over 6 years
    @Dave The hash constant is a fixed-point hex decimal. The decimal isn't part of the encoding, as it is implicitly before the most significant byte of the value. This is a bit confusing as C++ has (recently?) added hex floating point literals, but prior to that a decimal point and hex values wasn't legal C++. In short, omit the decimal point.
  • Peter Cordes
    Peter Cordes over 5 years
    Another way to think about this: XOR is reversible: it doesn't destroy information. You can XOR the same thing again to flip the bits back to what they were. AND and OR aren't reversible.
  • gigabytes
    gigabytes over 5 years
    @Dave, just reading your comments after some years... I think, instead of testing the macros, it would be better to just have two overloads for uint32_t and uint64_t.
  • Casey Rodarmor
    Casey Rodarmor over 5 years
    Why not do H(H(A) || H(B)) where || is concatenate?
  • Alex S
    Alex S over 5 years
    @CaseyRodarmor you could, but stringifying and concatenating two hashes and then computing a third hash is far more expensive than a multiplication and an addition for no improvement in the quality of the hash.
  • manlio
    manlio over 3 years
    In the last paragraph seed should probably be changed with lhs. Great answer!
  • Wolfgang Brehm
    Wolfgang Brehm almost 3 years
    "this means it erases less information" - no. There is the same amount of information when you add two random numbers and truncate or when you xor them. Both results have maximum entropy. The rest is still true though.
  • Peter Gerdes
    Peter Gerdes about 2 years
    Except, sometimes we want our hash to be order agnostic, e.g., when trying to hash an unordered collection.