Good GetHashCode() override for List of Foo objects respecting the order

30,823

Solution 1

I'd do it the same way I normally combine hash codes - with an addition and a multiplication:

public override int GetHashCode()
{
    unchecked
    {
        int hash = 19;
        foreach (var foo in foos)
        {
            hash = hash * 31 + foo.GetHashCode();
        }
        return hash;
    }
}

(Note that you shouldn't add anything to the list after this has been used for the key in a hash table of any description, as the hash will change. This also assumes that there are no null entries - if there could be, you need to take account of that.)

Solution 2

Firstly, double-check that you need a hashcode at all. Are you going to be putting these lists into a hash-mapped structure (e.g. dictionary, hashset, etc)? If not, forget about it.

Now, assuming that you mean that EnumerableObject already overrides Equals(object) (and hopefully therefore also implements IEquatable<EnumerableObject>) for some reason, then this is indeed necessary. You want to balance speed versus bit distribution.

A good starting point is a mult+add or a shift+xor like:

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    foreach(var item in this)
    {
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

(This assumes that you are using item.Equals() for your sequence equality comparison, if you're using an IEqualityComparer's equals you'll need to call into its hashcode).

From there we can optimise.

If null items are disallowed, remove the null-check (be careful, this will make the code throw if it ever does find a null).

If very large lists are common we need to reduce the number examined, while trying not to result in lots of collisions. Compare the following different implementations:

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    int max = Math.Min(Count, 16);
    for(int i = 0, i != max; ++i)
    {
        var item = this[i];
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    int min = Math.Max(-1, Count - 16);
    for(int i = Count -1, i != min; --i)
    {
        var item = this[i];
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

public override int GetHashCode()
{
    int res = 0x2D2816FE;
    int step = Count / 16 + 1;
    for(int i = 0, i < Count; i += step)
    {
        var item = this[i];
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
}

Each of these restrict the total number of items examined, which speeds execution but risks poorer quality hashes. Which (if any) is best depends on whether collections with the same start or the same end are more likely.

Changing the number 16 above adjusts the balance; smaller is faster but higher is better hash quality with a lower risk of hash collisions.

Edit: And now you can use my implementation of SpookyHash v. 2:

public override int GetHashCode()
{
  var hasher = new SpookyHash();//use methods with seeds if you need to prevent HashDos
  foreach(var item in this)
    hasher.Update(item.GetHashCode());//or relevant feeds of item, etc.
  return hasher.Final().GetHashCode();
}

This will create a much better distribution than mult+add or shift+xor, while also being particularly fast (especially in 64-bit processes as the algorithm is optimised for that, though it works well on 32-bit too).

Solution 3

The .GetHashCode() method usually just returns a hash based on the object reference (pointer address). This is because calculating the hash code of every item in an enumerable list can be very time intensive. Instead of overwriting the existing behaviour, I prefer to use an extension method and use it only where the hash code needs to be deterministically determined:

public static class EnumerableExtensions
{
    public static int GetSequenceHashCode<TItem>(this IEnumerable<TItem> list)
    {
        if (list == null) return 0;
        const int seedValue = 0x2D2816FE;
        const int primeNumber = 397;
        return list.Aggregate(seedValue, (current, item) => (current * primeNumber) + (Equals(item, default(TItem)) ? 0 : item.GetHashCode()));
    }
}
Share:
30,823
Ben B.
Author by

Ben B.

Updated on January 11, 2020

Comments

  • Ben B.
    Ben B. over 4 years

    EnumerableObject : IEnumerable<Foo>

    wraps a List<Foo>

    If EnumerableObject a.SequenceEquals( EnumerableObject b), then they are equal.

    Therefore, a GetHashCode must be implemented. The problem is XORing each element in the list will return the same hash code for any list with all and only the same elements, regardless of order. This is Okay in terms of it working, but will result in many collisions, which will slow down retrieval, etc.

    What is a good, fast GetHashCode method for lists of objects that is order dependent?

  • Vlad
    Vlad over 12 years
    Strange enough, List<T>'s implementation of GetHashCode is just inherited from object.
  • Ben B.
    Ben B. over 12 years
    Because List<T> equality is not sequence equality.
  • Jon Skeet
    Jon Skeet over 11 years
    @MK_Dev: Well, they're both primes, which generally helps. I'll confess to not understanding the maths behind why this hash generally works well. Multiplication by 31 is easily optimizable to a shift and a subtraction, which makes it attractive. There's a page explaining this kind of hash somewhere around, but I can't remember where, I'm afraid :(
  • Chris Nielsen
    Chris Nielsen over 10 years
    Can you explain where 0x2D2816FE came from? Google suggests this is .Net's hash code for an empty string, but I am not sure why that would be a good starting value. Jon Skeet's answer used 19 in a similar way, but I don't understand that either.
  • Jon Hanna
    Jon Hanna over 10 years
    @ChrisNielsen it's arbitary, and yes I just borrowed from that use. The main thing is that it shouldn't be zero so a to have a different code for an empty list and for null (most uses will assign zero to null objects) as those are two particularly common cases.
  • nawfal
    nawfal over 10 years
    @JonHanna you would want a unchecked scope for GetHashCode there.
  • David Schwartz
    David Schwartz over 10 years
    @Jon You're probably referring to this page, which you mentioned in this answer.
  • Jon Skeet
    Jon Skeet over 10 years
    @DavidSchwartz: Hmm - I think I've seen a longer page about Bernstein, but that's definitely better than nothing :)
  • nawfal
    nawfal over 10 years
    @JonSkeet a null check in the hash function will add to completeness :)
  • Jon Skeet
    Jon Skeet over 10 years
    @nawfal: I'd rather not distract from the general approach, to be honest. Hopefully anyone with a collection including null references can figure it out...
  • supercat
    supercat over 10 years
    If the list will never be mutated, the time to hash everything once will be a fixed multiple of the time spent building the list. I would favor hashing every item but caching the hash value. If the list will support limited forms of mutation (e.g. Add only), it may be possible to maintain an incrementally-updated hash value as the list grows. Also, in many cases I'd suggest keeping more than 32 bits of state within the hashing loop, and rendering the result to 32 bits at the end. That will help reduce the danger of "systematic" collisions.
  • Jon Hanna
    Jon Hanna over 10 years
    @supercat almost all uses of GetHashCode() will memoise the hash obtained, making the value of caching it within the method limited.
  • supercat
    supercat over 10 years
    @JonHanna: It's true that things like Dictionary will memoise the hash values of objects that are added; since adding an item requires calling GetHashCode() to figure out where it goes, storing the value poses minimal extra overhead. On the other hand, if an object gets stored in or checked against more than one dictionary, or if a dictionary is asked repeatedly for an item it does not contain, each such action will result in another GetHashCode call. Unless one knows the list will be small and repeated calls to nested items' GetHashCode() will be fast, caching would seem wise.
  • Jon Hanna
    Jon Hanna over 10 years
    @supercat that's assuming it is in fact the same instance that is hashed each time, when if anything the point of overriding is that it might not be, and often won't. I'd also consider that if calling GetHashCode() is particularly expensive, the problem is elsewhere, which funnily enough has been a scratch I've been itching of late (I wouldn't suggest the answer above any more, but something based on bitbucket.org/JonHanna/spookilysharp or at least on what it'll soon be after the initial ironing-out-the-wrinkles period.
  • supercat
    supercat over 10 years
    @JonHanna: The point of overriding is that it won't always be the same instance. The point of caching is that GetHashCode() may sometimes end up getting called multiple times on the same instance. That doesn't have to happen very often for caching to be a "win", and in some cases it can make a really huge difference (if the items cache their hash codes, it may not matter so much if the collection does, but if caching isn't considered worthwhile for the collection, why should one expect that it was considered worthwhile for the items)?
  • Rand Random
    Rand Random almost 8 years
    @JonSkeet does that solution produce consistent result with List of string across multiple AppDomains? I have a WCF Server where multiple application from various platforms (XP, Vista, Win7 or different .Net Frameworks 3.5 and higher) can connect and in all those situations I need a consistent hashcode from a list of strings is that the case? If not how would I achieve that?
  • Jon Skeet
    Jon Skeet almost 8 years
    @RandRandom: You shouldn't be using GetHashCode for that - it's not meant to be consistent across different processes. You should use something like SHA-256 for that.
  • Rand Random
    Rand Random almost 8 years
    @JonSkeet can you share your idea in that question: stackoverflow.com/questions/37880511/…
  • Muhammad Rehan Saeed
    Muhammad Rehan Saeed almost 5 years
    What about null vs empty collections? In this case, both will have hash codes of 19. should they be different? Why would you want that to be so?
  • Jon Skeet
    Jon Skeet almost 5 years
    @MuhammadRehanSaeed: What do you mean by a "null collection"? You can't call GetHashCode on a null collection. If foos is null, the code in the answer will throw an exception. If it can be null, then the method should take that into account. Whether null and empty should be considered equal is a context-specific decision.
  • Muhammad Rehan Saeed
    Muhammad Rehan Saeed almost 5 years
    @JonSkeet I meant if foos could be null. Why would you differentiate null from an empty collection. Under what context is such a differentiation valid? A colleague is proposing to do so, and I have no answer. stackoverflow.com/questions/56557128/…
  • Jon Skeet
    Jon Skeet almost 5 years
    @MuhammadRehanSaeed: You should ask your colleague rather than us. But if both states are valid, it seems perfectly reasonable to differentiate between them. (Someone carrying an empty box isn't the same as someone not carrying a box at all...)
  • JobaDiniz
    JobaDiniz almost 3 years
    why uncheked?
  • Jon Skeet
    Jon Skeet almost 3 years
    @JobaDiniz: Because it's entirely normal for hash code operations to overflow, and you really don't want that to throw an exception, even if the rest of the application is in a checked context.