Why is HashSet<int> much slower than HashSet<String> in dart?
I commented on the issue in the Dart repo. For completeness I will mention the 'answer' part of the comment here.
The implementations of HashSet
and LinkedHashSet
make the assumption that the key.hashCode
values are 'good' hash codes that are reasonably distributed over a range of integers so that the lower N bits do not collide or nearly collide to 'bunch up' in the hash-table. Unfortunately int.hashCode
does not have this property as it is effectively the identity function.
Things go wrong when the lower bits of all the keys are the same (or have only a few of the possible values) so taking the lower N bits gives the same effective hash code value. This is just the power-of-two version of the % 1000
example mentioned by @ch271828n.
@ch271828n mentions using a different hashCode. This is probably the best short-term solution. Use LinkedHashSet(hashCode: dispersedHashCode)
with something like this:
int dispersedHashCode(e) { // untested!
int hash = e.hashCode;
// Odd number with 30%-50% of the bits set in an irregular pattern.
hash *= 0x1736B4D29;
hash += hash >>> 20;
// maybe do it again to let bits higher that 20 influence the low bits.
return hash;
}
Something like this would ideally be built into the core library hashed structures. This might take a long time since, realistically, a performance issue with a simple work-around will be likely be prioritized behind security bugs, incorrect behaviour bugs, performance issues with no work-around, and new features that enable customers to do things that are otherwise impossible to difficult to do.
A completely different approach would be to use an ordered Set like SplayTreeSet
.
Comments
-
WSBT over 1 year
I noticed a
HashSet<int>
performing very slowly when working on a Flutter project. I had about 20,000 integers in aSet
, and checkingset.contains()
took a very long time. But when I usetoString()
to convert all items to string, it performed 1000x faster.I then tried to create a minimal reproducible code with 10 million random integers, but I couldn't reproduce the issue. Turns out, something special about these data caused the extreme slowness. I've attached a test code (and data) at the end of this question.
How to reproduce:
First, click "add int" button to add 14790 integers to a set. Then click "query int" (runs
set.contains(123)
) and "query string" (runsset.contains('123')
). Observe that: 1. both operations are super slow; 2. the int version is slower than the string version. Picture:Then click "clear items", then "add string" to add the
toString()
version of the data. Then click "query int" and "query string" again, notice how much faster it becomes. Picture:Lastly, click both "add int" and "add string" to create a mixed set (with twice the entries). Observe that the querying times dropped in half for the
int
version, as if the faster strings helped "dilute" the problem. Picture:I've had several friends running the same test code on various machines (intel i5, apple M1, snapdragon), timings are different but the conclusions are the same.
What's not the answer here:
Here are some things I considered, but they couldn't explain what's happening with some more tests.
- Maybe
int
needs boxing, whereasstring
is already an object?
That's probably not the issue here. With 1 million randomly generated values, ints performed faster than strings.
-
string
is immutable so their hash value could be cached?
I don't know if they are cached, but this doesn't explain the results observed with 1 million randomly generated values.
-
int
hash resulted in a lot of collisions?
I tried to print out
.hashCode
for all ints and strings in the data set, and verified they are all unique.Test code:
The full test code with data is too long for StackOverflow, I've put it here
https://pastebin.com/raw/4fm2hKQB
instead.So yeah, I'm lost, if anyone could help me understand what's going on that'll be greatly appreciated!
-
jamesdlin about 2 years1. It would have been easier if you made a console Dart program; that would make the code ~20 lines instead of ~100 lines and would be easier to run. 2. FWIW, when I test it (both with the Dart VM and AOT-compiled to executable), the problem seems to be specifically with
LinkedHashSet<int>
.HashSet<int>
andSplayTreeSet<int>
both seem much faster. (HashSet<String>
also is a bit faster thanLinkedHashSet<String>
, and I observe almost no difference betweenHashSet<int>
andHashSet<String>
. 3. You probably should just report an issue on github.com/dart-lang/sdk/issues. -
WSBT about 2 years@jamesdlin Thank you for helping. I didn't create an issue because I wasn't sure if it's my lack of understanding of how hashset works, so I thought to ask on SO first. Great point on LinkedHashSet, I didn't know that's what's behind the
{}
literal in dart. When I don't need items to be ordered, I'll usenew HashSet()
more often in the future. Although in this case, nothing beats the speed of using Strings in LinkedHashSet, hmm.. -
WSBT about 2 years@jamesdlin I've opened an issue github.com/dart-lang/sdk/issues/48641 thanks again! (And thank you for your edits here: stackoverflow.com/a/71595394/1032613 really appreciate it!)
-
ch271828n about 2 yearsFor future readers: github.com/dart-lang/sdk/issues/48641#issuecomment-1076187455 seems to be a great dart console program which is much easier to test and debug.
- Maybe
-
WSBT about 2 yearsThat's a fair point, "no collision in hashcode" does not guarantee "no collision after binning". This suggests
LinkedHashSet
uses a different binning algorithm thanHashSet
, which is a bit unusual to me. I'd love to find out the exact algorithms used. Do you know where we can find the source code for both data structures? -
ch271828n about 2 years@user1032613 What about this: github.com/dart-lang/sdk/search?l=C%2B%2B&q=LinkedHashSet (or this link cs.github.com/dart-lang/… if you have access to code-search)
-
ch271828n about 2 years@user1032613 how is it going? curious about the results
-
WSBT about 2 yearsI'm curious too but don't have the proper tools to look further into it (that's why I asked a question). Hopefully someone will find the relevant source code and give us an explanation soon.
-
ch271828n about 2 years@user1032613 The
LinkedHashSet(hashCode: (e) => some_other_hash_approach(e), equals: ...)
method should be quite easy to experiment without needing any extra tools -
WSBT about 2 yearsThank you for the explanation, I've tested multiple workarounds including
SplayTreeSet
but with my data,toString
in a hash set is the fastest, probably due to a smaller overhead when compared to maintaining a binary tree. You mentionedint.hashCode
is the int itself, may I ask how the String hashCode is computed please? -
jamesdlin about 2 yearsYou also should explain why the hash colliisons are so much worse with
LinkedHashSet
than withHashSet
.