Good GetHashCode() override for List of Foo objects respecting the order

c# algorithm hashcode gethashcode

30,823

Solution 1

I'd do it the same way I normally combine hash codes - with an addition and a multiplication:

public override int GetHashCode()
{
    unchecked
    {
        int hash = 19;
        foreach (var foo in foos)
        {
            hash = hash * 31 + foo.GetHashCode();
        }
        return hash;
    }
}

(Note that you shouldn't add anything to the list after this has been used for the key in a hash table of any description, as the hash will change. This also assumes that there are no null entries - if there could be, you need to take account of that.)

Solution 2

Firstly, double-check that you need a hashcode at all. Are you going to be putting these lists into a hash-mapped structure (e.g. dictionary, hashset, etc)? If not, forget about it.

Now, assuming that you mean that EnumerableObject already overrides Equals(object) (and hopefully therefore also implements IEquatable<EnumerableObject>) for some reason, then this is indeed necessary. You want to balance speed versus bit distribution.

A good starting point is a mult+add or a shift+xor like:

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    foreach(var item in this)
    {
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

(This assumes that you are using item.Equals() for your sequence equality comparison, if you're using an IEqualityComparer's equals you'll need to call into its hashcode).

From there we can optimise.

If null items are disallowed, remove the null-check (be careful, this will make the code throw if it ever does find a null).

If very large lists are common we need to reduce the number examined, while trying not to result in lots of collisions. Compare the following different implementations:

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    int max = Math.Min(Count, 16);
    for(int i = 0, i != max; ++i)
    {
        var item = this[i];
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    int min = Math.Max(-1, Count - 16);
    for(int i = Count -1, i != min; --i)
    {
        var item = this[i];
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    int step = Count / 16 + 1;
    for(int i = 0, i < Count; i += step)
    {
        var item = this[i];
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

Each of these restrict the total number of items examined, which speeds execution but risks poorer quality hashes. Which (if any) is best depends on whether collections with the same start or the same end are more likely.

Changing the number 16 above adjusts the balance; smaller is faster but higher is better hash quality with a lower risk of hash collisions.

Edit: And now you can use my implementation of SpookyHash v. 2:

public override int GetHashCode()
{
  var hasher = new SpookyHash();//use methods with seeds if you need to prevent HashDos
  foreach(var item in this)
    hasher.Update(item.GetHashCode());//or relevant feeds of item, etc.
  return hasher.Final().GetHashCode();
}

This will create a much better distribution than mult+add or shift+xor, while also being particularly fast (especially in 64-bit processes as the algorithm is optimised for that, though it works well on 32-bit too).

Solution 3

The .GetHashCode() method usually just returns a hash based on the object reference (pointer address). This is because calculating the hash code of every item in an enumerable list can be very time intensive. Instead of overwriting the existing behaviour, I prefer to use an extension method and use it only where the hash code needs to be deterministically determined:

public static class EnumerableExtensions
{
    public static int GetSequenceHashCode<TItem>(this IEnumerable<TItem> list)
    {
        if (list == null) return 0;
        const int seedValue = 0x2D2816FE;
        const int primeNumber = 397;
        return list.Aggregate(seedValue, (current, item) => (current * primeNumber) + (Equals(item, default(TItem)) ? 0 : item.GetHashCode()));
    }
}

30,823

Author by

Ben B.

Updated on January 11, 2020

Comments

Ben B. over 4 years

EnumerableObject : IEnumerable<Foo>

wraps a List<Foo>

If EnumerableObject a.SequenceEquals( EnumerableObject b), then they are equal.

Therefore, a GetHashCode must be implemented. The problem is XORing each element in the list will return the same hash code for any list with all and only the same elements, regardless of order. This is Okay in terms of it working, but will result in many collisions, which will slow down retrieval, etc.

What is a good, fast GetHashCode method for lists of objects that is order dependent?
Vlad over 12 years

Strange enough, List<T>'s implementation of GetHashCode is just inherited from object.
Ben B. over 12 years

Because List<T> equality is not sequence equality.
Jon Skeet over 11 years

@MK_Dev: Well, they're both primes, which generally helps. I'll confess to not understanding the maths behind why this hash generally works well. Multiplication by 31 is easily optimizable to a shift and a subtraction, which makes it attractive. There's a page explaining this kind of hash somewhere around, but I can't remember where, I'm afraid :(
Chris Nielsen over 10 years

Can you explain where 0x2D2816FE came from? Google suggests this is .Net's hash code for an empty string, but I am not sure why that would be a good starting value. Jon Skeet's answer used 19 in a similar way, but I don't understand that either.
Jon Hanna over 10 years

@ChrisNielsen it's arbitary, and yes I just borrowed from that use. The main thing is that it shouldn't be zero so a to have a different code for an empty list and for null (most uses will assign zero to null objects) as those are two particularly common cases.
nawfal over 10 years

@JonHanna you would want a unchecked scope for GetHashCode there.
David Schwartz over 10 years

@Jon You're probably referring to this page, which you mentioned in this answer.
Jon Skeet over 10 years

@DavidSchwartz: Hmm - I think I've seen a longer page about Bernstein, but that's definitely better than nothing :)
nawfal over 10 years

@JonSkeet a null check in the hash function will add to completeness :)
Jon Skeet over 10 years

@nawfal: I'd rather not distract from the general approach, to be honest. Hopefully anyone with a collection including null references can figure it out...
supercat over 10 years

If the list will never be mutated, the time to hash everything once will be a fixed multiple of the time spent building the list. I would favor hashing every item but caching the hash value. If the list will support limited forms of mutation (e.g. Add only), it may be possible to maintain an incrementally-updated hash value as the list grows. Also, in many cases I'd suggest keeping more than 32 bits of state within the hashing loop, and rendering the result to 32 bits at the end. That will help reduce the danger of "systematic" collisions.
Jon Hanna over 10 years

@supercat almost all uses of GetHashCode() will memoise the hash obtained, making the value of caching it within the method limited.
supercat over 10 years

@JonHanna: It's true that things like Dictionary will memoise the hash values of objects that are added; since adding an item requires calling GetHashCode() to figure out where it goes, storing the value poses minimal extra overhead. On the other hand, if an object gets stored in or checked against more than one dictionary, or if a dictionary is asked repeatedly for an item it does not contain, each such action will result in another GetHashCode call. Unless one knows the list will be small and repeated calls to nested items' GetHashCode() will be fast, caching would seem wise.
Jon Hanna over 10 years

@supercat that's assuming it is in fact the same instance that is hashed each time, when if anything the point of overriding is that it might not be, and often won't. I'd also consider that if calling GetHashCode() is particularly expensive, the problem is elsewhere, which funnily enough has been a scratch I've been itching of late (I wouldn't suggest the answer above any more, but something based on bitbucket.org/JonHanna/spookilysharp or at least on what it'll soon be after the initial ironing-out-the-wrinkles period.
supercat over 10 years

@JonHanna: The point of overriding is that it won't always be the same instance. The point of caching is that GetHashCode() may sometimes end up getting called multiple times on the same instance. That doesn't have to happen very often for caching to be a "win", and in some cases it can make a really huge difference (if the items cache their hash codes, it may not matter so much if the collection does, but if caching isn't considered worthwhile for the collection, why should one expect that it was considered worthwhile for the items)?
Rand Random almost 8 years

@JonSkeet does that solution produce consistent result with List of string across multiple AppDomains? I have a WCF Server where multiple application from various platforms (XP, Vista, Win7 or different .Net Frameworks 3.5 and higher) can connect and in all those situations I need a consistent hashcode from a list of strings is that the case? If not how would I achieve that?
Jon Skeet almost 8 years

@RandRandom: You shouldn't be using GetHashCode for that - it's not meant to be consistent across different processes. You should use something like SHA-256 for that.
Rand Random almost 8 years

@JonSkeet can you share your idea in that question: stackoverflow.com/questions/37880511/…
Muhammad Rehan Saeed almost 5 years

What about null vs empty collections? In this case, both will have hash codes of 19. should they be different? Why would you want that to be so?
Jon Skeet almost 5 years

@MuhammadRehanSaeed: What do you mean by a "null collection"? You can't call GetHashCode on a null collection. If foos is null, the code in the answer will throw an exception. If it can be null, then the method should take that into account. Whether null and empty should be considered equal is a context-specific decision.
Muhammad Rehan Saeed almost 5 years

@JonSkeet I meant if foos could be null. Why would you differentiate null from an empty collection. Under what context is such a differentiation valid? A colleague is proposing to do so, and I have no answer. stackoverflow.com/questions/56557128/…
Jon Skeet almost 5 years

@MuhammadRehanSaeed: You should ask your colleague rather than us. But if both states are valid, it seems perfectly reasonable to differentiate between them. (Someone carrying an empty box isn't the same as someone not carrying a box at all...)
JobaDiniz almost 3 years

why uncheked?
Jon Skeet almost 3 years

@JobaDiniz: Because it's entirely normal for hash code operations to overflow, and you really don't want that to throw an exception, even if the rest of the application is in a checked context.