Why is HashSet<Point> so much slower than HashSet<string>?
Solution 1
There are two perf problems induced by the Point struct. Something you can see when you add Console.WriteLine(GC.CollectionCount(0));
to the test code. You'll see that the Point test requires ~3720 collections but the string test only needs ~18 collections. Not for free. When you see a value type induce so many collections then you need to conclude "uh-oh, too much boxing".
At issue is that HashSet<T>
needs an IEqualityComparer<T>
to get its job done. Since you did not provide one, it needs to fall back to one returned by EqualityComparer.Default<T>()
. That method can do a good job for string, it implements IEquatable. But not for Point, it is a type that harks from .NET 1.0 and never got the generics love. All it can do is use the Object methods.
The other issue is that Point.GetHashCode() does not do a stellar job in this test, too many collisions, so it hammers Object.Equals() pretty heavily. String has an excellent GetHashCode implementation.
You can solve both problems by providing the HashSet with a good comparer. Like this one:
class PointComparer : IEqualityComparer<Point> {
public bool Equals(Point x, Point y) {
return x.X == y.X && x.Y == y.Y;
}
public int GetHashCode(Point obj) {
// Perfect hash for practical bitmaps, their width/height is never >= 65536
return (obj.Y << 16) ^ obj.X;
}
}
And use it:
HashSet<Point> list = new HashSet<Point>(new PointComparer());
And it is now about 150 times faster, easily beating the string test.
Solution 2
The main reason for the performance drop is all the boxing going on (as already explained in Hans Passant's answer).
Apart from that, the hash code algorithm worsens the problem, because it causes more calls to Equals(object obj)
thus increasing the amount of boxing conversions.
Also note that the hash code of Point
is computed by x ^ y
. This produces very little dispersion in your data range, and therefore the buckets of the HashSet
are overpopulated — something that doesn't happen with string
, where the dispersion of the hashes is much larger.
You can solve that problem by implementing your own Point
struct (trivial) and using a better hash algorithm for your expected data range, e.g. by shifting the coordinates:
(x << 16) ^ y
For some good advice when it comes to hash codes, read Eric Lippert's blog post on the subject.
Related videos on Youtube
41686d6564 stands w. Palestine
Updated on September 07, 2022Comments
-
41686d6564 stands w. Palestine about 1 year
I wanted to store some pixels locations without allowing duplicates, so the first thing comes to mind is
HashSet<Point>
or similar classes. However this seems to be very slow compared to something likeHashSet<string>
.For example, this code:
HashSet<Point> points = new HashSet<Point>(); using (Bitmap img = new Bitmap(1000, 1000)) { for (int x = 0; x < img.Width; x++) { for (int y = 0; y < img.Height; y++) { points.Add(new Point(x, y)); } } }
takes about 22.5 seconds.
While the following code (which is not a good choice for obvious reasons) takes only 1.6 seconds:
HashSet<string> points = new HashSet<string>(); using (Bitmap img = new Bitmap(1000, 1000)) { for (int x = 0; x < img.Width; x++) { for (int y = 0; y < img.Height; y++) { points.Add(x + "," + y); } } }
So, my questions are:
- Is there a reason for that? I checked this answer, but 22.5 sec is way more than the numbers shown in that answer.
- Is there a better way to store points without duplicates?
-
Kobi about 6 yearsSimilar question (by me): Why are HashSets of structs with nullable values incredibly slow?
-
Ivan Yurchenko about 4 yearsWhat are these "obvious reasons" for not using concatenated strings? What is the better way of doing it if I don't want to implement my own IEqualityComparer?
-
Martin Smith about 6 yearsSounds feasible but how have you come to this conclusion?
-
InBetween about 6 years@MartinSmith because there is no other good reason, if a hash set is slow for one type and not another under the same circumstances then it has to be due to the hash code implementation of the slow type not producing enough dispersion.
-
Gilad Green about 6 yearsLooking at the reference source of Point the
GetHashCode
performs:unchecked(x ^ y)
while forstring
it looks much more complicated.. -
41686d6564 stands w. Palestine about 6 yearsHmm.. well, to check if your assumption is correct, I just tried using
HashSet<long>()
instead, and usedlist.Add(unchecked(x ^ y));
to add values to the HashSet. This was actually even faster thanHashSet<string>
(345 ms). Is this somehow different from what you described? -
Gilad Green about 6 years@AhmedAbdelhameed - Don't know but I have looked here
-
41686d6564 stands w. Palestine about 6 years@GiladGreen Me too, and the
GetHashCode
method does returnunchecked(x ^ y)
, but according to the test I mentioned in my last comment, this doesn't seem to be what slows things down. Or am I missing something? -
InBetween about 6 years@AhmedAbdelhameed that's probably because you are adding way less members to your hash set than you realize (again due to the horrible dispersion of the hash code algorithm). What's the count of
list
when you've finished populating it? -
Ofir Winegarten about 6 years@AhmedAbdelhameed Your test is wrong. You are adding the same longs over and over, so actually there are just few elements you are inserting. When inserting
point
, theHashSet
will internally callGetHashCode
and for each of those points with the same hashcode, will callEquals
to determine if it's already exists -
41686d6564 stands w. Palestine about 6 years@OfirWinegarten, oh was I so stupid! Yes, you're absolutely right. I used @InBetween's suggestion, and created my own
point
struct, overrodeGetHashCode
with the suggested hash algorithm, and it worked like a charm. Thank you all for your help! -
41686d6564 stands w. Palestine about 6 years@InBetween, Yes, I was wrong about that. Sorry for the inconvenience, your answer really helped. Thank you :)
-
Jon Hanna about 6 yearsThere's no need to implement
Point
when you can create a class that implementsIEqualityComparer<Point>
and keep compatibility with other things that work withPoint
while getting the benefit of not having the poorGetHashCode
and the need to box inEquals()
. -
InBetween about 6 years@JonHanna correct, but
Point
is simply so hideous that I tend to avoid it whenever possible. By the context I'm assuming Op only wants an immutable struct that represents coordinates (mutable types and hash sets don't mix well), so he's better off putting together his own. -
Jon Hanna about 6 years@InBetween I wouldn't bother to use
Point
unless I needed it for point stuff, in which case I needPoint
. On the plus-side, corefx has a better hash-code and an implementation ofIEquatable<Point>
which will hopefully move to netfx some day. -
Akash KC about 6 years+1 for providing GetHashCode method implementation. Just for curiosity, how did you come with particular
obj.X << 16 | obj.Y;
implementation. -
user1703401 about 6 yearsIt was inspired by the way the mouse passes its position in windows. It is a perfect hash for any bitmap you'd ever want to display.
-
Akash KC about 6 yearsGood to know that. Any documentation or best guideline to write hashcode like yours ? Actually, I still would like to know whether above hashcode comes with your experience or any guideline which you follow.
-
MSeifert about 6 years@AkashKC I'm not very experiences with C# but as far as I know integers are generally 32bits. In this case you want the hash of 2 numbers and by left-shifting one 16bits you make sure the "lower" 16 bits of each number don't "affect" the other with
|
. For 3 numbers it could make sense to use 22 and 11 as shift. For 4 numbers it would be 24, 16, 8. However there will be still collisions but only if the numbers get large. But it also crucially depends on theHashSet
implementation. If it uses open-adressing with "bit truncation" (I don't think it does!) the left-shift approach might be bad. -
InBetween about 6 years@BensaysNotoPoliticsonSO First, you are mixing up buckets with collisions. Second, there are way more collisions than two per point: the hash code range is
[0, 1023]
, thats1024
distinct hashes. The number of points is1 000 000
, so on average, you have976
collisions per point. Second, due to the very low dispersion of the hash code algorithm, its alltogether possible that all1 000 000
points are stored in the same bucket. -
Kobi about 6 years@InBetween - You should remove the meta comment you've added to your answer. You don't need to apologize for not having the best answer. Your answer is still good.
-
Brian about 6 years@AkashKC: The official documentation is at msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx , but that's primarily a reference. I recommend reading Eric Lippert's excellent guide at blogs.msdn.microsoft.com/ericlippert/2011/02/28/… .
-
Akash KC about 6 years@MSeifert : Thanks for the comment. Now, it's making sense for me and understood logic behind. With similar impplementation, if we need to hash of 6 numbers, how can it be done with this approach?
-
Krazy Glew about 6 years@HansPassant: I wonder if using XOR rather than OR in GetHashCode might be slightly better - in the event that point coordinates might exceed 16 bits (perhaps not on common displays, but near future). // XOR is usually better in hash functions than OR, since it loses less information, is reversibke, etc. // e.g. If negative coordinates are allowed, consider what happens to the X contribution if Y is negative.
-
Jim about 6 yearsmy understanding is that the best hash code shift values (except perhaps in this case) should be prime numbers?
-
Ehsan Sajjad about 6 yearsif
Point
isstruct
, then comparing two objects would involve reflection which can also be one of the reason for it to be slow -
user1703401 about 6 yearsI don't disagree with Krazy. It doesn't matter for bitmaps but people might be using it for other purposes. Changed.
-
Rich about 6 years@EhsanSajjad: Only when Equals is not overridden. Which it is.
-
Jean Hominal about 6 years@HansPassant: If we want to consider "purposes other than bitmaps" for
PointComparer.GetHashCode
, should we not consider as a defect the fact that the higher 16 bits ofY
are lost? E.g. write(obj.Y << 16) ^ obj.X ^ (obj.Y >> 16)
-
user1703401 about 6 yearsSigh. It is no more or less of a mistake as the original Point.GetHashCode() implementation. Blindly copying somebody's custom hash function is a mistake, but is bound to happen anyway because hashing is a black art to many programmers, all I can do is limit the damage when it happens. If you think you found a better algorithm then just test it to make sure it is. But do keep in mind that you can never beat a perfect hash. Like this one, perfect because it works with bitmaps.
-
Brian about 6 years@JeanHominal: The nice thing about going from OR to XOR is that it improves "purposes other than bitmaps" without hurting the performance for bitmaps. Further, prefering XOR over OR is generally good practice when implementing
GetHashCode
). On the other hand, your proposal has an extra operation. This negligibly reduces readability/maintainability and probably hurts performance. Given that a substantial portion of SO users will be usingPoint
for bitmaps, Hans' choice to optimize for bitmaps first is a reasonable decision...especially since the OP never contradicted this decision. -
ToolmakerSteve over 5 years@HansPassant - minor nit, that probably is not significant in OP's case: for small bitmaps, using obj.X without any multiplier means poor distribution over the 32-bit hash space. Maybe that doesn't matter, given .NET's prime-length hash tables? I'm still learning; I thought it worth questioning what you did. I'd use something like
return ((long)x * 805306457 + (long)y * 189783887).GetHashCode();
. [CAVEAT: I haven't tested to confirm best two primes] On the other hand, your formula is the simplest, fastest hash code I've ever seen, so maybe perfect in this specific use. :) -
ripvlan over 4 yearsExcellent. Very helpful. While MSDN HashSet documentation states performance is dependent upon the quality of the hash algorithm (see Remarks Note in the link), it doesn't really say how to determine it's quality or that you're using one of poor quality. msdn.microsoft.com/en-us/library/xfhwa508(v=vs.110).aspx