How to implement a good hash function in python

python hash

95,128

Solution 1

__hash__ should return the same value for objects that are equal. It also shouldn't change over the lifetime of the object; generally you only implement it for immutable objects.

A trivial implementation would be to just return 0. This is always correct, but performs badly.

Your solution, returning the hash of a tuple of properties, is good. But note that you don't need to list all properties that you compare in __eq__ in the tuple. If some property usually has the same value for inequal objects, just leave it out. Don't make the hash computation any more expensive than it needs to be.

Edit: I would recommend against using xor to mix hashes in general. When two different properties have the same value, they will have the same hash, and with xor these will cancel eachother out. Tuples use a more complex calculation to mix hashes, see tuplehash in tupleobject.c.

Solution 2

Documentation for object.__hash__(self)

The only required property is that objects which compare equal have the same hash value; it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple. Example

def __hash__(self):
    return hash((self.name, self.nick, self.color))

Solution 3

It's dangerous to write

def __eq__(self, other):
  return other and self.a == other.a and self.b == other.b

because if your rhs (i.e., other) object evaluates to boolean False, it will never compare as equal to anything!

In addition, you might want to double check if other belongs to the class or subclass of AClass. If it doesn't, you'll either get exception AttributeError or a false positive (if the other class happens to have the same-named attributes with matching values). So I would recommend to rewrite __eq__ as:

def __eq__(self, other):
  return isinstance(other, self.__class__) and self.a == other.a and self.b == other.b

If by any chance you want an unusually flexible comparison, which compares across unrelated classes as long as attributes match by name, you'd still want to at least avoid AttributeError and check that other doesn't have any additional attributes. How you do it depends on the situation (since there's no standard way to find all attributes of an object).

95,128

user2756501

Master geek, developer, avid reader and one of the minds behind novlet.com and bitlet.org :P@abahgat

Updated on July 08, 2022

Comments

user2756501 almost 2 years
When implementing a class with multiple properties (like in the toy example below), what is the best way to handle hashing?

I guess that the __eq__ and __hash__ should be consistent, but how to implement a proper hash function that is capable of handling all the properties?
```
class AClass:
  def __init__(self):
      self.a = None
      self.b = None

  def __eq__(self, other):
      return other and self.a == other.a and self.b == other.b

  def __ne__(self, other):
    return not self.__eq__(other)

  def __hash__(self):
      return hash((self.a, self.b))
```
I read on this question that tuples are hashable, so I was wondering if something like the example above was sensible. Is it?
- Feuermurmel about 10 years
  
  Just make sure to use hash() on a tuple with exactly the elements that are compared in __eq__() and friends (exactly as you did) and you're good to go.
Björn Pollex over 13 years

As you said hash functions usually only make sense for immutable objects. Hence it is possible to calculate the hash-value once in __init__.
Scott Griffiths over 13 years

+1 for the return 0 hash function - I've always thought that anything else is premature optimisation :-). (I'm only half kidding).
eigenein about 13 years

It will work, but it's bad that if you exchange self.a and self.b then you'll get the same hash while it will be the other "object".
user1066101 about 13 years

"somehow mix together (e.g. using exclusive or" is a pretty flexible set of requirements. If it actually matters, then (hash(self.a)<<1) ^ hash(self.b) might be better. There's no general answer, just a general guideline that has to be modified based on the specific application.
max about 12 years

@BjörnPollex Rather than doing it in __init__, you can just cache the value in __hash__. That way if __hash__ is never called, you didn't waste either time or memory. I assume checking whether the value has been already cached isn't expensive is it? (Not sure if it's best through exception or explicit if).
Fred Foo almost 12 years

It's unfortunate that Python does not make a combine_hashes function available.
Aaron Barclay almost 11 years

Your first statement seems incorrect. "It also shouldn't change over the lifetime of the object; generally you only implement it for immutable objects." It is implemented in mutable objects as well no?
javawizard almost 11 years

It's not implemented in things like dict or list, the justification being that changing the hash of an object that already belongs to, e.g., a set wreaks havoc on the set's internal data structures.
max about 9 years

@eigenein if many cases, it's an advantage that the hash is unchanged when the order is changed. if you try to hash a dict or set, the hash that depends on the order of iteration is invalid. OTOH, the hash that simply causes extra collisions once in a while is, at worst, inefficient.
nightpool over 8 years

why not just hash a tuple value? hash((self.a, self.b))
Carl Meyer almost 8 years

@AaronBarclay in Python 2, all user-defined classes inherit the default object.__hash__ by default, so they are always hashable, but frequently break the hash-equality invariant if they define a custom __eq__. So in Py2 if you define __eq__, you should really always either set __hash__ = None (if instances are mutable) or define __hash__ (if immutable). In Py3 defining __eq__ automatically sets __hash__ to None, so just avoid defining __hash__ unless instances are immutable. See asmeurer.github.io/blog/posts/…
PM 2Ring over 7 years

Note that (fortunately) the suggestion to use xor is no longer present in the Python 3 or Python 2 docs.
AXO almost 7 years

For those whom are interested, here is the bug that led to removal of XOR recommendation: bugs.python.org/issue28383
fersarr over 6 years

@axo we should update this answer with the hash of tuples
Mad Physicist over 6 years

Useful info, but unrelated to the primary question about hashing.
Leo Ufimtsev about 5 years

Not related, but thank you for posting this never the less. +1.
Zaya almost 5 years

@AXO, thanks for providing the link! Very interesting stuff.
ShadowRanger over 4 years

@FredFoo: But they do. That's what hash called on a tuple does, combine the hashes of the values in the tuple. The cost of making a small tuple is trivial (Python optimizes tuples specially since they're created implicitly in many different ways to support non-tuple-specific functionality), and calling hash on it combines the hashes of its component parts. Any reasonable implementation of a combine_hashes function would do the same thing, constructing a tuple of the arguments (built-ins receive their arguments as a tuple implicitly), then calling hash on it.
ShadowRanger over 4 years

@FredFoo: combine_hashes would actually be more complex than that, because if you're going to expose such a function, you'd need to make an order-insensitive version as well (or allow passing an argument to make it an order-insensitive hash). Python's solution for that case is to just do hash(frozenset({attributes, to, hash})), which gets the canonical order-insensitive hash.
ShadowRanger over 4 years

This is a bad __eq__ implementation, as it does not delegate to the right hand side's __eq__ if the left hand does not know how to do the comparison. If int had a __eq__ like this, 1 == MyNumericType(1) would always be False, even if MyNumericType(1) == 1 would return True. If you don't recognize the type of other, always return NotImplemented, don't just return False.
max over 4 years

@ShadowRanger Agreed.
Chris Graf over 3 years

Could someone elaborate on return 0 being always correct, but performing badly? How can this always be correct? Shouldn't hash values be different for different objects? What about the performance part? Why does it perform badly?