Why is HashSet<Point> so much slower than HashSet<string>?

11,400

Solution 1

There are two perf problems induced by the Point struct. Something you can see when you add Console.WriteLine(GC.CollectionCount(0)); to the test code. You'll see that the Point test requires ~3720 collections but the string test only needs ~18 collections. Not for free. When you see a value type induce so many collections then you need to conclude "uh-oh, too much boxing".

At issue is that HashSet<T> needs an IEqualityComparer<T> to get its job done. Since you did not provide one, it needs to fall back to one returned by EqualityComparer.Default<T>(). That method can do a good job for string, it implements IEquatable. But not for Point, it is a type that harks from .NET 1.0 and never got the generics love. All it can do is use the Object methods.

The other issue is that Point.GetHashCode() does not do a stellar job in this test, too many collisions, so it hammers Object.Equals() pretty heavily. String has an excellent GetHashCode implementation.

You can solve both problems by providing the HashSet with a good comparer. Like this one:

class PointComparer : IEqualityComparer<Point> {
    public bool Equals(Point x, Point y) {
        return x.X == y.X && x.Y == y.Y;
    }

    public int GetHashCode(Point obj) {
        // Perfect hash for practical bitmaps, their width/height is never >= 65536
        return (obj.Y << 16) ^ obj.X;
    }
}

And use it:

HashSet<Point> list = new HashSet<Point>(new PointComparer());

And it is now about 150 times faster, easily beating the string test.

Solution 2

The main reason for the performance drop is all the boxing going on (as already explained in Hans Passant's answer).

Apart from that, the hash code algorithm worsens the problem, because it causes more calls to Equals(object obj) thus increasing the amount of boxing conversions.

Also note that the hash code of Point is computed by x ^ y. This produces very little dispersion in your data range, and therefore the buckets of the HashSet are overpopulated — something that doesn't happen with string, where the dispersion of the hashes is much larger.

You can solve that problem by implementing your own Point struct (trivial) and using a better hash algorithm for your expected data range, e.g. by shifting the coordinates:

(x << 16) ^ y

For some good advice when it comes to hash codes, read Eric Lippert's blog post on the subject.

Share:
11,400

Related videos on Youtube

41686d6564 stands w. Palestine
Author by

41686d6564 stands w. Palestine

Updated on September 07, 2022

Comments

  • 41686d6564 stands w. Palestine
    41686d6564 stands w. Palestine about 1 year

    I wanted to store some pixels locations without allowing duplicates, so the first thing comes to mind is HashSet<Point> or similar classes. However this seems to be very slow compared to something like HashSet<string>.

    For example, this code:

    HashSet<Point> points = new HashSet<Point>();
    using (Bitmap img = new Bitmap(1000, 1000))
    {
        for (int x = 0; x < img.Width; x++)
        {
            for (int y = 0; y < img.Height; y++)
            {
                points.Add(new Point(x, y));
            }
        }
    }
    

    takes about 22.5 seconds.

    While the following code (which is not a good choice for obvious reasons) takes only 1.6 seconds:

    HashSet<string> points = new HashSet<string>();
    using (Bitmap img = new Bitmap(1000, 1000))
    {
        for (int x = 0; x < img.Width; x++)
        {
            for (int y = 0; y < img.Height; y++)
            {
                points.Add(x + "," + y);
            }
        }
    }
    

    So, my questions are:

    • Is there a reason for that? I checked this answer, but 22.5 sec is way more than the numbers shown in that answer.
    • Is there a better way to store points without duplicates?
  • Martin Smith
    Martin Smith about 6 years
    Sounds feasible but how have you come to this conclusion?
  • InBetween
    InBetween about 6 years
    @MartinSmith because there is no other good reason, if a hash set is slow for one type and not another under the same circumstances then it has to be due to the hash code implementation of the slow type not producing enough dispersion.
  • Gilad Green
    Gilad Green about 6 years
    Looking at the reference source of Point the GetHashCode performs: unchecked(x ^ y) while for string it looks much more complicated..
  • 41686d6564 stands w. Palestine
    41686d6564 stands w. Palestine about 6 years
    Hmm.. well, to check if your assumption is correct, I just tried using HashSet<long>() instead, and used list.Add(unchecked(x ^ y)); to add values to the HashSet. This was actually even faster than HashSet<string> (345 ms). Is this somehow different from what you described?
  • Gilad Green
    Gilad Green about 6 years
    @AhmedAbdelhameed - Don't know but I have looked here
  • 41686d6564 stands w. Palestine
    41686d6564 stands w. Palestine about 6 years
    @GiladGreen Me too, and the GetHashCode method does return unchecked(x ^ y), but according to the test I mentioned in my last comment, this doesn't seem to be what slows things down. Or am I missing something?
  • InBetween
    InBetween about 6 years
    @AhmedAbdelhameed that's probably because you are adding way less members to your hash set than you realize (again due to the horrible dispersion of the hash code algorithm). What's the count of list when you've finished populating it?
  • Ofir Winegarten
    Ofir Winegarten about 6 years
    @AhmedAbdelhameed Your test is wrong. You are adding the same longs over and over, so actually there are just few elements you are inserting. When inserting point, the HashSet will internally call GetHashCode and for each of those points with the same hashcode, will call Equals to determine if it's already exists
  • 41686d6564 stands w. Palestine
    41686d6564 stands w. Palestine about 6 years
    @OfirWinegarten, oh was I so stupid! Yes, you're absolutely right. I used @InBetween's suggestion, and created my own point struct, overrode GetHashCode with the suggested hash algorithm, and it worked like a charm. Thank you all for your help!
  • 41686d6564 stands w. Palestine
    41686d6564 stands w. Palestine about 6 years
    @InBetween, Yes, I was wrong about that. Sorry for the inconvenience, your answer really helped. Thank you :)
  • Jon Hanna
    Jon Hanna about 6 years
    There's no need to implement Point when you can create a class that implements IEqualityComparer<Point> and keep compatibility with other things that work with Point while getting the benefit of not having the poor GetHashCode and the need to box in Equals().
  • InBetween
    InBetween about 6 years
    @JonHanna correct, but Point is simply so hideous that I tend to avoid it whenever possible. By the context I'm assuming Op only wants an immutable struct that represents coordinates (mutable types and hash sets don't mix well), so he's better off putting together his own.
  • Jon Hanna
    Jon Hanna about 6 years
    @InBetween I wouldn't bother to use Point unless I needed it for point stuff, in which case I need Point. On the plus-side, corefx has a better hash-code and an implementation of IEquatable<Point> which will hopefully move to netfx some day.
  • Akash KC
    Akash KC about 6 years
    +1 for providing GetHashCode method implementation. Just for curiosity, how did you come with particular obj.X << 16 | obj.Y; implementation.
  • user1703401
    user1703401 about 6 years
    It was inspired by the way the mouse passes its position in windows. It is a perfect hash for any bitmap you'd ever want to display.
  • Akash KC
    Akash KC about 6 years
    Good to know that. Any documentation or best guideline to write hashcode like yours ? Actually, I still would like to know whether above hashcode comes with your experience or any guideline which you follow.
  • MSeifert
    MSeifert about 6 years
    @AkashKC I'm not very experiences with C# but as far as I know integers are generally 32bits. In this case you want the hash of 2 numbers and by left-shifting one 16bits you make sure the "lower" 16 bits of each number don't "affect" the other with |. For 3 numbers it could make sense to use 22 and 11 as shift. For 4 numbers it would be 24, 16, 8. However there will be still collisions but only if the numbers get large. But it also crucially depends on the HashSet implementation. If it uses open-adressing with "bit truncation" (I don't think it does!) the left-shift approach might be bad.
  • InBetween
    InBetween about 6 years
    @BensaysNotoPoliticsonSO First, you are mixing up buckets with collisions. Second, there are way more collisions than two per point: the hash code range is [0, 1023], thats 1024 distinct hashes. The number of points is 1 000 000, so on average, you have 976 collisions per point. Second, due to the very low dispersion of the hash code algorithm, its alltogether possible that all 1 000 000 points are stored in the same bucket.
  • Kobi
    Kobi about 6 years
    @InBetween - You should remove the meta comment you've added to your answer. You don't need to apologize for not having the best answer. Your answer is still good.
  • Brian
    Brian about 6 years
    @AkashKC: The official documentation is at msdn.microsoft.com/en-us/library/system.object.gethashcode.a‌​spx , but that's primarily a reference. I recommend reading Eric Lippert's excellent guide at blogs.msdn.microsoft.com/ericlippert/2011/02/28/… .
  • Akash KC
    Akash KC about 6 years
    @MSeifert : Thanks for the comment. Now, it's making sense for me and understood logic behind. With similar impplementation, if we need to hash of 6 numbers, how can it be done with this approach?
  • Krazy Glew
    Krazy Glew about 6 years
    @HansPassant: I wonder if using XOR rather than OR in GetHashCode might be slightly better - in the event that point coordinates might exceed 16 bits (perhaps not on common displays, but near future). // XOR is usually better in hash functions than OR, since it loses less information, is reversibke, etc. // e.g. If negative coordinates are allowed, consider what happens to the X contribution if Y is negative.
  • Jim
    Jim about 6 years
    my understanding is that the best hash code shift values (except perhaps in this case) should be prime numbers?
  • Ehsan Sajjad
    Ehsan Sajjad about 6 years
    if Point is struct, then comparing two objects would involve reflection which can also be one of the reason for it to be slow
  • user1703401
    user1703401 about 6 years
    I don't disagree with Krazy. It doesn't matter for bitmaps but people might be using it for other purposes. Changed.
  • Rich
    Rich about 6 years
    @EhsanSajjad: Only when Equals is not overridden. Which it is.
  • Jean Hominal
    Jean Hominal about 6 years
    @HansPassant: If we want to consider "purposes other than bitmaps" for PointComparer.GetHashCode, should we not consider as a defect the fact that the higher 16 bits of Y are lost? E.g. write (obj.Y << 16) ^ obj.X ^ (obj.Y >> 16)
  • user1703401
    user1703401 about 6 years
    Sigh. It is no more or less of a mistake as the original Point.GetHashCode() implementation. Blindly copying somebody's custom hash function is a mistake, but is bound to happen anyway because hashing is a black art to many programmers, all I can do is limit the damage when it happens. If you think you found a better algorithm then just test it to make sure it is. But do keep in mind that you can never beat a perfect hash. Like this one, perfect because it works with bitmaps.
  • Brian
    Brian about 6 years
    @JeanHominal: The nice thing about going from OR to XOR is that it improves "purposes other than bitmaps" without hurting the performance for bitmaps. Further, prefering XOR over OR is generally good practice when implementing GetHashCode). On the other hand, your proposal has an extra operation. This negligibly reduces readability/maintainability and probably hurts performance. Given that a substantial portion of SO users will be using Point for bitmaps, Hans' choice to optimize for bitmaps first is a reasonable decision...especially since the OP never contradicted this decision.
  • ToolmakerSteve
    ToolmakerSteve over 5 years
    @HansPassant - minor nit, that probably is not significant in OP's case: for small bitmaps, using obj.X without any multiplier means poor distribution over the 32-bit hash space. Maybe that doesn't matter, given .NET's prime-length hash tables? I'm still learning; I thought it worth questioning what you did. I'd use something like return ((long)x * 805306457 + (long)y * 189783887).GetHashCode();. [CAVEAT: I haven't tested to confirm best two primes] On the other hand, your formula is the simplest, fastest hash code I've ever seen, so maybe perfect in this specific use. :)
  • ripvlan
    ripvlan over 4 years
    Excellent. Very helpful. While MSDN HashSet documentation states performance is dependent upon the quality of the hash algorithm (see Remarks Note in the link), it doesn't really say how to determine it's quality or that you're using one of poor quality. msdn.microsoft.com/en-us/library/xfhwa508(v=vs.110).aspx