Why is HashSet<int> much slower than HashSet<String> in dart?

256

I commented on the issue in the Dart repo. For completeness I will mention the 'answer' part of the comment here.

The implementations of HashSet and LinkedHashSet make the assumption that the key.hashCode values are 'good' hash codes that are reasonably distributed over a range of integers so that the lower N bits do not collide or nearly collide to 'bunch up' in the hash-table. Unfortunately int.hashCode does not have this property as it is effectively the identity function.

Things go wrong when the lower bits of all the keys are the same (or have only a few of the possible values) so taking the lower N bits gives the same effective hash code value. This is just the power-of-two version of the % 1000 example mentioned by @ch271828n.

@ch271828n mentions using a different hashCode. This is probably the best short-term solution. Use LinkedHashSet(hashCode: dispersedHashCode) with something like this:

int dispersedHashCode(e) { // untested!
  int hash = e.hashCode;
  // Odd number with 30%-50% of the bits set in an irregular pattern.
  hash *= 0x1736B4D29;
  hash += hash >>> 20;
  // maybe do it again to let bits higher that 20 influence the low bits.
  return hash;
}

Something like this would ideally be built into the core library hashed structures. This might take a long time since, realistically, a performance issue with a simple work-around will be likely be prioritized behind security bugs, incorrect behaviour bugs, performance issues with no work-around, and new features that enable customers to do things that are otherwise impossible to difficult to do.

A completely different approach would be to use an ordered Set like SplayTreeSet.

Share:
256
WSBT
Author by

WSBT

A confused human trying to understand computers.

Updated on January 04, 2023

Comments

  • WSBT
    WSBT over 1 year

    I noticed a HashSet<int> performing very slowly when working on a Flutter project. I had about 20,000 integers in a Set, and checking set.contains() took a very long time. But when I use toString() to convert all items to string, it performed 1000x faster.

    I then tried to create a minimal reproducible code with 10 million random integers, but I couldn't reproduce the issue. Turns out, something special about these data caused the extreme slowness. I've attached a test code (and data) at the end of this question.

    How to reproduce:

    First, click "add int" button to add 14790 integers to a set. Then click "query int" (runs set.contains(123)) and "query string" (runs set.contains('123')). Observe that: 1. both operations are super slow; 2. the int version is slower than the string version. Picture:

    int test

    Then click "clear items", then "add string" to add the toString() version of the data. Then click "query int" and "query string" again, notice how much faster it becomes. Picture:

    string test

    Lastly, click both "add int" and "add string" to create a mixed set (with twice the entries). Observe that the querying times dropped in half for the int version, as if the faster strings helped "dilute" the problem. Picture:

    int string mix test

    I've had several friends running the same test code on various machines (intel i5, apple M1, snapdragon), timings are different but the conclusions are the same.

    What's not the answer here:

    Here are some things I considered, but they couldn't explain what's happening with some more tests.

    1. Maybe int needs boxing, whereas string is already an object?

    That's probably not the issue here. With 1 million randomly generated values, ints performed faster than strings.

    1. string is immutable so their hash value could be cached?

    I don't know if they are cached, but this doesn't explain the results observed with 1 million randomly generated values.

    1. int hash resulted in a lot of collisions?

    I tried to print out .hashCode for all ints and strings in the data set, and verified they are all unique.

    Test code:

    The full test code with data is too long for StackOverflow, I've put it here https://pastebin.com/raw/4fm2hKQB instead.

    So yeah, I'm lost, if anyone could help me understand what's going on that'll be greatly appreciated!

    • jamesdlin
      jamesdlin about 2 years
      1. It would have been easier if you made a console Dart program; that would make the code ~20 lines instead of ~100 lines and would be easier to run. 2. FWIW, when I test it (both with the Dart VM and AOT-compiled to executable), the problem seems to be specifically with LinkedHashSet<int>. HashSet<int> and SplayTreeSet<int> both seem much faster. (HashSet<String> also is a bit faster than LinkedHashSet<String>, and I observe almost no difference between HashSet<int> and HashSet<String>. 3. You probably should just report an issue on github.com/dart-lang/sdk/issues.
    • WSBT
      WSBT about 2 years
      @jamesdlin Thank you for helping. I didn't create an issue because I wasn't sure if it's my lack of understanding of how hashset works, so I thought to ask on SO first. Great point on LinkedHashSet, I didn't know that's what's behind the {} literal in dart. When I don't need items to be ordered, I'll use new HashSet() more often in the future. Although in this case, nothing beats the speed of using Strings in LinkedHashSet, hmm..
    • WSBT
      WSBT about 2 years
      @jamesdlin I've opened an issue github.com/dart-lang/sdk/issues/48641 thanks again! (And thank you for your edits here: stackoverflow.com/a/71595394/1032613 really appreciate it!)
    • ch271828n
      ch271828n about 2 years
      For future readers: github.com/dart-lang/sdk/issues/48641#issuecomment-107618745‌​5 seems to be a great dart console program which is much easier to test and debug.
  • WSBT
    WSBT about 2 years
    That's a fair point, "no collision in hashcode" does not guarantee "no collision after binning". This suggests LinkedHashSet uses a different binning algorithm than HashSet, which is a bit unusual to me. I'd love to find out the exact algorithms used. Do you know where we can find the source code for both data structures?
  • ch271828n
    ch271828n about 2 years
    @user1032613 What about this: github.com/dart-lang/sdk/search?l=C%2B%2B&q=LinkedHashSet (or this link cs.github.com/dart-lang/… if you have access to code-search)
  • ch271828n
    ch271828n about 2 years
    @user1032613 how is it going? curious about the results
  • WSBT
    WSBT about 2 years
    I'm curious too but don't have the proper tools to look further into it (that's why I asked a question). Hopefully someone will find the relevant source code and give us an explanation soon.
  • ch271828n
    ch271828n about 2 years
    @user1032613 The LinkedHashSet(hashCode: (e) => some_other_hash_approach(e), equals: ...) method should be quite easy to experiment without needing any extra tools
  • WSBT
    WSBT about 2 years
    Thank you for the explanation, I've tested multiple workarounds including SplayTreeSet but with my data, toString in a hash set is the fastest, probably due to a smaller overhead when compared to maintaining a binary tree. You mentioned int.hashCode is the int itself, may I ask how the String hashCode is computed please?
  • jamesdlin
    jamesdlin about 2 years
    You also should explain why the hash colliisons are so much worse with LinkedHashSet than with HashSet.