Fast hash function for `std::vector`

11,015

Solution 1

NOTE: As per the comments, you get a 25-50x speed-up by compiling with optimizations. Do that, first. Then, if it's still too slow, see below.


I don't think there's much you can do. You have to touch all the elements, and that combination function is about as fast as it gets.

One option may be to parallelize the hash function. If you have 8 cores, you can run 8 threads to each hash 1/8th of the vector, then combine the 8 resulting values at the end. The synchronization overhead may be worth it for very large vectors.

Solution 2

The approach that MSVC's old hashmap used was to sample less often.

This means that isolated changes won't show up in your hash, but the thing you are trying to avoid is reading and processing the entire 80 mb of data in order to hash your vector. Not reading some characters is pretty unavoidable.

The second thing you should do is not specialize std::hash on all vectors, this may make your program ill-formed (as suggested by a defect resolution whose status I do not recall), and at the least is a bad plan (as the std is sure to permit itself to add hash combining and hashing of vectors).

When I write a custom hash, I usually use ADL (Koenig Lookup) to make it easy to extend.

namespace my_utils {
  namespace hash_impl {
    namespace details {
      namespace adl {
        template<class T>
        std::size_t hash(T const& t) {
          return std::hash<T>{}(t);
        }
      }
      template<class T>
      std::size_t hasher(T const& t) {
        using adl::hash;
        return hash(t);
      }
    }
    struct hash_tag {};
    template<class T>
    std::size_t hash(hash_tag, T const& t) {
      return details::hasher(t);
    }
    template<class T>
    std::size_t hash_combine(hash_tag, std::size_t seed, T const& t) {
      seed ^= hash(t) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
    }
    template<class Container>
    std::size_t fash_hash_random_container(hash_tag, Container const& c ) {
      std::size_t size = c.size();
      std::size_t stride = 1 + size/10;
      std::size_t r = hash(hash_tag{}, size);
      for(std::size_t i = 0; i < size; i += stride) {
        r = hash_combine(hash_tag{}, r, c.data()[i])
      }
      return r;
    }
    // std specializations go here:
    template<class T, class A>
    std::size_t hash(hash_tag, std::vector<T,A> const& v) {
      return fash_hash_random_container(hash_tag{}, v);
    }
    template<class T, std::size_t N>
    std::size_t hash(hash_tag, std::array<T,N> const& a) {
      return fash_hash_random_container(hash_tag{}, a);
    }
    // etc
  }
  struct my_hasher {
    template<class T>
    std::size_t operator()(T const& t)const {
      return hash_impl::hash(hash_impl::hash_tag{}, t);
    }
  };
}

now my_hasher is a universal hasher. It uses either hashes declared in my_utils::hash_impl (for std types), or free functions called hash that will hash a given type, to hash things. Failing that, it tries to use std::hash<T>. If that fails, you get a compile-time error.

Writing a free hash function in the namespace of the type you want to hash tends to be less annoying than having to go off and open std and specialize std::hash in my experience.

It understands vectors and arrays, recursively. Doing tuples and pairs requires a bit more work.

It samples said vectors and arrays at about 10 times.

(Note: hash_tag is both a bit of a joke, and a way to force ADL and prevent having to forward-declare the hash specializations in the hash_impl namespace, because that requirement sucks.)

The price of sampling is that you could get more collisions.


Another approach if you have a huge amount of data is to hash them once, and keep track of when they are modified. To do this approach, use a copy-on-write monad interface for your type that keeps track of if the hash is up to date. Now a vector gets hashed once; if you modify it, the hash is discarded.

One can go futher and have a random-access hash (where it is easy to predict what happens when you edit a given value hash-wise), and mediate all access to the vector. That is tricky.

You could also multi-thread the hashing, but I would guess that your code is probably memory-bandwidth bound, and multi-threading won't help much there. Worth trying.

You could use a fancier structure than a flat vector (something tree like), where changes to the values bubble-up in a hash-like way to a root hash value. This would add a lg(n) overhead to all element access. Again, you'd have to wrap the raw data up in controls that keep the hashing up to date (or, keep track of what ranges are dirty and needs to be updated).

Finally, because you are working with 10 million elements at a time, consider moving over to a strong large-scale storage solution, like databases or what have you. Using 80 megabyte keys in a map seems strange to me.

Share:
11,015
justHelloWorld
Author by

justHelloWorld

Updated on July 24, 2022

Comments

  • justHelloWorld
    justHelloWorld almost 2 years

    I implemented this solution for getting an hash value from vector<T>:

    namespace std
    {
        template<typename T>
        struct hash<vector<T>>
        {
            typedef vector<T> argument_type;
            typedef std::size_t result_type;
            result_type operator()(argument_type const& in) const
            {
                size_t size = in.size();
                size_t seed = 0;
                for (size_t i = 0; i < size; i++)
                    //Combine the hash of the current vector with the hashes of the previous ones
                    hash_combine(seed, in[i]);
                return seed;
            }
        };
    }
    
    //using boost::hash_combine
    template <class T>
    inline void hash_combine(std::size_t& seed, T const& v)
    {
        seed ^= std::hash<T>()(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
    }
    

    But this solution doesn't scale at all: with a vector<double> of 10 millions elements it's gonna take more than 2.5 s (according to VS).

    Does exists a fast hash function for this scenario?

    Notice that creating an hash value from the vector reference is not a feasible solution, since the related unordred_map will be used in different runs and in addition two vector<double> with the same content but different addresses will be mapped differently (undesired behavior for this application).

  • justHelloWorld
    justHelloWorld about 8 years
    Ok, now with the release configuration activated it takes "only" 117 ms. But if we have to do a re-hash of millions of keys (for example because of a bad load factor in the unordered_map) this is still an unacceptable solution. So I think that I'll try to improve it even more with your parallel solution! Thanks!