Quick and Simple Hash Code Combinations

48,333

Solution 1

I would personally avoid XOR - it means that any two equal values will result in 0 - so hash(1, 1) == hash(2, 2) == hash(3, 3) etc. Also hash(5, 0) == hash(0, 5) etc which may come up occasionally. I have deliberately used it for set hashing - if you want to hash a sequence of items and you don't care about the ordering, it's nice.

I usually use:

unchecked
{
    int hash = 17;
    hash = hash * 31 + firstField.GetHashCode();
    hash = hash * 31 + secondField.GetHashCode();
    return hash;
}

That's the form that Josh Bloch suggests in Effective Java. Last time I answered a similar question I managed to find an article where this was discussed in detail - IIRC, no-one really knows why it works well, but it does. It's also easy to remember, easy to implement, and easy to extend to any number of fields.

Solution 2

If you are using .NET Core 2.1 or later or .NET Framework 4.6.1 or later, consider using the System.HashCode struct to help with producing composite hash codes. It has two modes of operation: Add and Combine.

An example using Combine, which is usually simpler and works for up to eight items:

public override int GetHashCode()
{
    return HashCode.Combine(object1, object2);
}

An example of using Add:

public override int GetHashCode()
{
    var hash = new HashCode();
    hash.Add(this.object1);
    hash.Add(this.object2);
    return hash.ToHashCode();
}

Pros:

  • Part of .NET itself, as of .NET Core 2.1/.NET Standard 2.1 (though, see con below)
    • For .NET Framework 4.6.1 and later, the Microsoft.Bcl.HashCode NuGet package can be used to backport this type.
  • Looks to have good performance and mixing characteristics, based on the work the author and the reviewers did before merging this into the corefx repo
  • Handles nulls automatically
  • Overloads that take IEqualityComparer instances

Cons:

Solution 3

While the template outlined in Jon Skeet's answer works well in general as a hash function family, the choice of the constants is important and the seed of 17 and factor of 31 as noted in the answer do not work well at all for common use cases. In most use cases, the hashed values are much closer to zero than int.MaxValue, and the number of items being jointly hashed are a few dozen or less.

For hashing an integer tuple {x, y} where -1000 <= x <= 1000 and -1000 <= y <= 1000, it has an abysmal collision rate of almost 98.5%. For example, {1, 0} -> {0, 31}, {1, 1} -> {0, 32}, etc. If we expand the coverage to also include n-tuples where 3 <= n <= 25, it does less terrible with a collision rate of about 38%. But we can do much better.

public static int CustomHash(int seed, int factor, params int[] vals)
{
    int hash = seed;
    foreach (int i in vals)
    {
        hash = (hash * factor) + i;
    }
    return hash;
}

I wrote a Monte Carlo sampling search loop that tested the method above with various values for seed and factor over various random n-tuples of random integers i. Allowed ranges were 2 <= n <= 25 (where n was random but biased toward the lower end of the range) and -1000 <= i <= 1000. At least 12 million unique collision tests were performed for each seed and factor pair.

After about 7 hours running, the best pair found (where the seed and factor were both limited to 4 digits or less) was: seed = 1009, factor = 9176, with a collision rate of 0.1131%. In the 5- and 6-digit areas, even better options exist. But I selected the top 4-digit performer for brevity, and it peforms quite well in all common int and char hashing scenarios. It also seems to work fine with integers of much greater magnitudes.

It is worth noting that "being prime" did not seem to be a general prerequisite for good performance as a seed and/or factor although it likely helps. 1009 noted above is in fact prime, but 9176 is not. I explicitly tested variations on this where I changed factor to various primes near 9176 (while leaving seed = 1009) and they all performed worse than the above solution.

Lastly, I also compared against the generic ReSharper recommendation function family of hash = (hash * factor) ^ i; and the original CustomHash() as noted above seriously outperforms it. The ReSharper XOR style seems to have collision rates in the 20-30% range for common use case assumptions and should not be used in my opinion.

Solution 4

Use the combination logic in tuple. The example is using c#7 tuples.

(field1, field2).GetHashCode();

Solution 5

I presume that .NET Framework team did a decent job in testing their System.String.GetHashCode() implementation, so I would use it:

// System.String.GetHashCode(): http://referencesource.microsoft.com/#mscorlib/system/string.cs,0a17bbac4851d0d4
// System.Web.Util.StringUtil.GetStringHashCode(System.String): http://referencesource.microsoft.com/#System.Web/Util/StringUtil.cs,c97063570b4e791a
public static int CombineHashCodes(IEnumerable<int> hashCodes)
{
    int hash1 = (5381 << 16) + 5381;
    int hash2 = hash1;

    int i = 0;
    foreach (var hashCode in hashCodes)
    {
        if (i % 2 == 0)
            hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ hashCode;
        else
            hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ hashCode;

        ++i;
    }

    return hash1 + (hash2 * 1566083941);
}

Another implementation is from System.Web.Util.HashCodeCombiner.CombineHashCodes(System.Int32, System.Int32) and System.Array.CombineHashCodes(System.Int32, System.Int32) methods. This one is simpler, but probably doesn't have such a good distribution as the method above:

// System.Web.Util.HashCodeCombiner.CombineHashCodes(System.Int32, System.Int32): http://referencesource.microsoft.com/#System.Web/Util/HashCodeCombiner.cs,21fb74ad8bb43f6b
// System.Array.CombineHashCodes(System.Int32, System.Int32): http://referencesource.microsoft.com/#mscorlib/system/array.cs,87d117c8cc772cca
public static int CombineHashCodes(IEnumerable<int> hashCodes)
{
    int hash = 5381;

    foreach (var hashCode in hashCodes)
        hash = ((hash << 5) + hash) ^ hashCode;

    return hash;
}
Share:
48,333
RobV
Author by

RobV

Senior Software Engineer at HPE working in the AI &amp; Advanced Productivity group which is part of the HPC and MCS business unit My day job involves developing tools that enable cloud inspired user productivity solutions on High Performance Computing systems. As well as working to integrate and enable popular tools for AI and Big Data Analytics on high performance computing platforms. I have a PhD and BSc(Hons) in Computer Science, my PhD research was conducted at the University of Southampton, UK. My research interests revolved around the Semantic Web, Linked Data and robustness of Hypermedia systems. As part of this research I originated the open source Semantic Web framework for .Net called dotNetRDF and also became a committer on the Apache Jena project which is a widely used Semantic Web framework for Java and which I use in my day job.

Updated on July 23, 2022

Comments

  • RobV
    RobV almost 2 years

    Can people recommend quick and simple ways to combine the hash codes of two objects. I am not too worried about collisions since I have a Hash Table which will handle that efficiently I just want something that generates a code quickly as possible.

    Reading around SO and the web there seem to be a few main candidates:

    1. XORing
    2. XORing with Prime Multiplication
    3. Simple numeric operations like multiplication/division (with overflow checking or wrapping around)
    4. Building a String and then using the String classes Hash Code method

    What would people recommend and why?

  • RobV
    RobV over 14 years
    how would I tell whether my hash codes are evenly distributed, is there an easy benchmark to do for this? I know the collision rate is pretty low but does that necessarily correspond to an even distribution?
  • RobV
    RobV over 14 years
    Yeah that was my concern about XORing, in the type of data I'm pairing it's fairly unlikely to be pairing too equal items but not impossible
  • ephemient
    ephemient over 14 years
    Looks like Dan Bernstein's (or Chris Torek's) hash, just with different constants. Nobody knows why that works well either.
  • Jon Skeet
    Jon Skeet over 14 years
    @RobV: I prefer not to have to think if I don't have to. I use this hash even when I could get away with XOR, just to avoid having to wonder whether it's safe or not :)
  • Henk Holterman
    Henk Holterman over 14 years
    No, They have a very different purpose and break the rule that GetHashCode should be fast.
  • SuperString
    SuperString over 14 years
    what are the firstField and secondField that you talk about?
  • Jon Skeet
    Jon Skeet over 14 years
    @SuperString: the first and second fields you want to hash. You then keep going for the other fields.
  • Henk Holterman
    Henk Holterman over 14 years
    A word of warning, this is (a variation of) the Berstein hash, and because nobody knows why it does well in tests it is not advisable when hashing is critical. See eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx‌​. Also, you should wrap this code in an unchecked { } block. GetHashCode() should not throw any exceptions.
  • tofutim
    tofutim over 11 years
    This is so mysterious - why 17? why 31? is it important that they are primes?
  • Jon Skeet
    Jon Skeet over 11 years
    @tofutim: The 31 is a good choice as multiplication by 31 can be optimized to a shift and subtract. Whether it is optimized that way depends on the platform. As for why those numbers work well for hashing - as Henk says, it's a bit of a mystery.
  • Patrick M
    Patrick M about 9 years
    Why do you prime (no pun intended) the pump with 17 and 31? Why isn't it just return firstField.GetHashCode() * 31 + secondField.GetHashCode();? Or is that part of what no one knows?
  • Jon Skeet
    Jon Skeet about 9 years
    @PatrickM: I follow Josh Bloch (or some variant), basically. It does mean that it's harder to end up with a hash code of 0...
  • Patrick M
    Patrick M about 9 years
    But is a hash code of 0 any worse than a hash code of 16337? (17 * 31 * 31, which is amusingly close to 3L337...) I suppose having more multiplications means you have bigger hash codes, on the whole. Is there a similar improvement in distribution?
  • Jon Skeet
    Jon Skeet about 9 years
    @PatrickM: I think hash codes of 0 are more likely to occur between different types, raising the chance of a collision. For example, two objects with different numbers of fields, but where all the fields give a hash code of 0, would have different hashes under this scheme. But this is a pretty tenuous argument, I'm willing to admit :)
  • yoyo
    yoyo about 8 years
    The Bernstein hash (DJB, DJB2) uses 33, not 31. And 5381, not 17. But yeah, otherwise very similar. I've used DJB2 for both string hashing and hashcode combination and it's worked fine for me. DJB2 and other algorithms can be found at cse.yorku.ca/~oz/hash.html
  • Tom Leys
    Tom Leys over 7 years
    Wow. I love the work that went into this answer. Impressive, well done!
  • RobV
    RobV over 6 years
    Nice idea though I suspect that this might have issues with GC churn since you are implicitly creating a short lived object
  • rory.ap
    rory.ap over 6 years
    @Jon can you please comment on Special Sauce's answer? I'd like to know what your thoughts are on it.
  • Jon Skeet
    Jon Skeet over 6 years
    @rory.ap: I think it's an excellent piece of work, and I'd be perfectly happy to use those numbers. While I hate to admit to using constants "because someone else said to" that's basically what the 17/31 pair is about.
  • Mike Pedersen
    Mike Pedersen over 6 years
    @RobV Tuples are value types, so they are stack allocated and exert no GC pressure.
  • dynamichael
    dynamichael almost 6 years
    One problem... (0,1,2).GetHashCode() and (0,0,1,2).GetHashCode() both yield the same value: 35. Whereas the method in the most upvoted answer yields unique values 0, 1, 2: 506480 and 0, 0, 1, 2: 15699890
  • Yepeekai
    Yepeekai almost 6 years
    Hashcodes aren't guaranteed to be unique. You found a case where it isn't... It doesn't make it a bad choice unless there are a lot of collisions(in that case it would be a good idea to submit a bug). I personally prefer to use something from the framework instead of implementing something different.
  • cactuaroid
    cactuaroid almost 6 years
    It's actually ValueTuple type of a struct (MSDN). Be careful that Tuple type is a class and it has GC pressure. I like this way. Internally it's similar to @Stipo 's post but very easy to understand and to review. It would be good choice in most cases.
  • M.kazem Akhgary
    M.kazem Akhgary over 5 years
    "no body knows" feels like higher beings know it and we cant comprehend it!
  • Cosmin Sontu
    Cosmin Sontu over 5 years
    Starting with .NET Core 2.1 you can use the System.HashCode type's Combine method to do this docs.microsoft.com/en-us/dotnet/api/system.hashcode.combine
  • Eric Ouellet
    Eric Ouellet over 4 years
    Hi, That's seems to be a great way to get a good distribution of values. The only thing that bothers me a lot: the process is order dependant. Which mean that you will get different Hash depending on the order you feed them. Thats is prone to error. You can add a comparison at start on both inner hash to solve the problem (to ensure to calculate the hash based on the smaller or bigger inner hash. But you have to keep in mind that is good only for a pair, not more.
  • Eric Ouellet
    Eric Ouellet over 4 years
    Seems to be the best but there is 2 remarks: First and easy one, why not moving "seed" and "factor" to the end and give them a default value (1009 and 9176) where it shuold do the job for most peoples. Second point: like Jon Skeet algo, that is order dependant and you can get a different hash if you feed in a different order. I wonder if it would not be safer to sort that array first in order to ensure to have the same final hash at the end either if you feed the algo in different way. That would become safer.
  • Jon Skeet
    Jon Skeet over 4 years
    @EricOuellet: I would say that being order-dependent is a good thing. You don't want (1, 0) and (0, 1) to give the same hash code, for example, as that gives more hash collisions.
  • Eric Ouellet
    Eric Ouellet over 4 years
    @JonSkeet, I parlty agree. I would say that for some case it is a good thing but it really depends on the situation. I think that it would be really important to bring a clear distinction on the 2 very specific aspects of the effect on the way to combine hash code. I think that it is important (distinction between order dependant and order independant) and many peoples seems to under estimate it. It is very case specific.
  • Jon Skeet
    Jon Skeet over 4 years
    @EricOuellet: It would only be a good thing for it to be order-independent if equality was also defined in an order-independent way, which is rare in my experience.
  • Special Sauce
    Special Sauce over 4 years
    @EricOuellet Because the params int[] vals must come at the end of all function arguments, I was not able to make the seed and factor defaulted parameters. If you don't care about the params syntax convenience, you could remove it and then reorder the parameters to allow the defaults as you suggested.
  • Special Sauce
    Special Sauce over 4 years
    @EricOuellet The default hashing for an array should be to consider permutations (this is the more general case) and thus the hash would be different for different orderings (just as the hash for string "abc" is different than the hash for "acb"). If you specifically wanted a hash function for combinations only, you should probably accept a HashSet<int> argument (assuming no duplicates). Otherwise you could rename the function to CustomHashCombination() to prevent confusion, and do the internal pre-sort as suggested.
  • Gru
    Gru over 4 years
    I like this answer, but I wouldn't use params because it has to allocate an array on every call. So it may be faster computations-wise, but it creates GC pressure for later.
  • ChrisTorng
    ChrisTorng almost 4 years
    You can reference nuget.org/packages/Microsoft.Bcl.HashCode to make it working on .NET Framework 4.6.1 or .NET Standard 2.0, officially.
  • Vivraan
    Vivraan over 3 years
    I'm still unable to find System.HashCode - I get a compiler error when I try to use it.
  • Jon Skeet
    Jon Skeet over 3 years
    @mushi: Which platform are you targeting?
  • Vivraan
    Vivraan over 3 years
    Android/Unity. I later realised github.com/GlitchEnzo/NuGetForUnity exists, so I now have what I need! Except I don't know if it compiles to Android or not.
  • Vivraan
    Vivraan over 3 years
    What were the 5 and 6-digit numbers that got generated? Will 1009 and 9176 work well for hashing in a table of 2^28 (int.MaxValue / 16) values?
  • Mas Biru
    Mas Biru about 3 years
    Best answer, works on both .net 4.5 and on core. Use the System.ValueTuple nuget package on .net 4.5, and on core it's built in.
  • Maxence
    Maxence over 2 years
    System.HashCode uses xxHash32 (github.com/Cyan4973/xxHash)
  • Red Riding Hood
    Red Riding Hood almost 2 years
    Does Bcl.HashCode generate stable hashes for strings?
  • chwarr
    chwarr almost 2 years
    @RedRidingHood, HashCode is not the right type to use for a hash code that gets persisted, which is what I assume you mean by "stable". HashCode values are only stable within the same instance of a process. This has been the documented behavior of .NET GetHashCode and related values for a very, very long time. Though, .NET Core started using a per-process seed to enforce this. If you have more questions, I suggest you post it as a stand-alone question.
  • Red Riding Hood
    Red Riding Hood almost 2 years
    Thanks, I was aware of the Net Core behavior, I just wasn't sure if the Bcl package was different then what Net Core is using.