What does hash do in python?

144,362

Solution 1

A hash is an fixed sized integer that identifies a particular value. Each value needs to have its own hash, so for the same value you will get the same hash even if it's not the same object.

>>> hash("Look at me!")
4343814758193556824
>>> f = "Look at me!"
>>> hash(f)
4343814758193556824

Hash values need to be created in such a way that the resulting values are evenly distributed to reduce the number of hash collisions you get. Hash collisions are when two different values have the same hash. Therefore, relatively small changes often result in very different hashes.

>>> hash("Look at me!!")
6941904779894686356

These numbers are very useful, as they enable quick look-up of values in a large collection of values. Two examples of their use are Python's set and dict. In a list, if you want to check if a value is in the list, with if x in values:, Python needs to go through the whole list and compare x with each value in the list values. This can take a long time for a long list. In a set, Python keeps track of each hash, and when you type if x in values:, Python will get the hash-value for x, look that up in an internal structure and then only compare x with the values that have the same hash as x.

The same methodology is used for dictionary lookup. This makes lookup in set and dict very fast, while lookup in list is slow. It also means you can have non-hashable objects in a list, but not in a set or as keys in a dict. The typical example of non-hashable objects is any object that is mutable, meaning that you can change its value. If you have a mutable object it should not be hashable, as its hash then will change over its life-time, which would cause a lot of confusion, as an object could end up under the wrong hash value in a dictionary.

Note that the hash of a value only needs to be the same for one run of Python. In Python 3.3 they will in fact change for every new run of Python:

$ /opt/python33/bin/python3
Python 3.3.2 (default, Jun 17 2013, 17:49:21) 
[GCC 4.6.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> hash("foo")
1849024199686380661
>>> 
$ /opt/python33/bin/python3
Python 3.3.2 (default, Jun 17 2013, 17:49:21) 
[GCC 4.6.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> hash("foo")
-7416743951976404299

This is to make is harder to guess what hash value a certain string will have, which is an important security feature for web applications etc.

Hash values should therefore not be stored permanently. If you need to use hash values in a permanent way you can take a look at the more "serious" types of hashes, cryptographic hash functions, that can be used for making verifiable checksums of files etc.

Solution 2

TL;DR:

Please refer to the glossary: hash() is used as a shortcut to comparing objects, an object is deemed hashable if it can be compared to other objects. that is why we use hash(). It's also used to access dict and set elements which are implemented as resizable hash tables in CPython.

Technical considerations

  • usually comparing objects (which may involve several levels of recursion) is expensive.
  • preferably, the hash() function is an order of magnitude (or several) less expensive.
  • comparing two hashes is easier than comparing two objects, this is where the shortcut is.

If you read about how dictionaries are implemented, they use hash tables, which means deriving a key from an object is a corner stone for retrieving objects in dictionaries in O(1). That's however very dependent on your hash function to be collision-resistant. The worst case for getting an item in a dictionary is actually O(n).

On that note, mutable objects are usually not hashable. The hashable property means you can use an object as a key. If the hash value is used as a key and the contents of that same object change, then what should the hash function return? Is it the same key or a different one? It depends on how you define your hash function.

Learning by example:

Imagine we have this class:

>>> class Person(object):
...     def __init__(self, name, ssn, address):
...         self.name = name
...         self.ssn = ssn
...         self.address = address
...     def __hash__(self):
...         return hash(self.ssn)
...     def __eq__(self, other):
...         return self.ssn == other.ssn
... 

Please note: this is all based on the assumption that the SSN never changes for an individual (don't even know where to actually verify that fact from authoritative source).

And we have Bob:

>>> bob = Person('bob', '1111-222-333', None)

Bob goes to see a judge to change his name:

>>> jim = Person('jim bo', '1111-222-333', 'sf bay area')

This is what we know:

>>> bob == jim
True

But these are two different objects with different memory allocated, just like two different records of the same person:

>>> bob is jim
False

Now comes the part where hash() is handy:

>>> dmv_appointments = {}
>>> dmv_appointments[bob] = 'tomorrow'

Guess what:

>>> dmv_appointments[jim] #?
'tomorrow'

From two different records you are able to access the same information. Now try this:

>>> dmv_appointments[hash(jim)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 9, in __eq__
AttributeError: 'int' object has no attribute 'ssn'
>>> hash(jim) == hash(hash(jim))
True

What just happened? That's a collision. Because hash(jim) == hash(hash(jim)) which are both integers btw, we need to compare the input of __getitem__ with all items that collide. The builtin int does not have an ssn attribute so it trips.

>>> del Person.__eq__
>>> dmv_appointments[bob]
'tomorrow'
>>> dmv_appointments[jim]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: <__main__.Person object at 0x7f611bd37110>

In this last example, I show that even with a collision, the comparison is performed, the objects are no longer equal, which means it successfully raises a KeyError.

Solution 3

The Python docs for hash() state:

Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup.

Python dictionaries are implemented as hash tables. So any time you use a dictionary, hash() is called on the keys that you pass in for assignment, or look-up.

Additionally, the docs for the dict type state:

Values that are not hashable, that is, values containing lists, dictionaries or other mutable types (that are compared by value rather than by object identity) may not be used as keys.

Solution 4

The hash is used by dictionaries and sets to quickly look up the object. A good starting point is Wikipedia's article on hash tables.

Share:
144,362

Related videos on Youtube

Roman
Author by

Roman

Updated on November 06, 2020

Comments

  • Roman
    Roman over 3 years

    I saw an example of code that where hash function is applied to a tuple. As a result it returns a negative integer. I wonder what does this function do? Google does not help. I found a page that explains how hash is calculated but it does not explain why we need this function.

    • TerryA
      TerryA almost 11 years
      Did you look at the docs...
    • tailor_raj
      tailor_raj almost 11 years
      go to this link (official documentation). It specifies everything. go to link!
    • dnozay
      dnozay almost 11 years
      I like that the question is not a repeat of "what is it" but a "why we need it".
    • Rasmi Ranjan Nayak
      Rasmi Ranjan Nayak over 3 years
      official link is very confusing
  • Matthias
    Matthias almost 11 years
    Concerning potential hash collisions: hash(-1) == hash(-2) (runnin Python 2.7)
  • PaulDong
    PaulDong over 7 years
    Really handy explanation. As a novice, this helped me figure out how to create class that can be put into sets and use them as keys for dictionary / hash table. Also if I do collection[hashable_obj] = hashable_obj, I could get a pointer to that instance later on. But do tell me if there's a better way to keep track of such collections.
  • overexchange
    overexchange about 7 years
    @dnozay But, still, output of hash() is a fixed sized integer, that can cause collision
  • Jet Blue
    Jet Blue over 6 years
    Can someone elaborate on the use of __eq__ in the example above. Is it called by the dictionary when it is trying to compare the key it receives with all the keys it has? Such that by del the __eq__ method in the last example, the dictionary has nothing to call to use to determine the equivalency of the key it has received with the keys it has?
  • The_Martian
    The_Martian over 6 years
    I am running Python 3.6.1 and collision exists.
  • WloHu
    WloHu almost 6 years
    @JetBlue The "collosion" explaination is incomplete in the example with key hash(jim). Person.__eq__ is called because existing key has the same hash as hash(jim) to assure that this the right key Person.__eq__ is used. It errs because it assumes that other, which is int, has a ssn attribute. If hash(jim) key didn't exist in the dictionary __eq__ wouldn't be called. This explains when key lookup can be O(n): when all items have the same hash __eq__ must be used on all of them, e.g. in the case when key doesn't exist.
  • Alexis
    Alexis almost 6 years
    Although I understand the pedagogical interest of your example, wouldn't it be simpler to just write dmv_appointments[bob.ssn] = 'tomorrow', removing the need to define a __hash__ method? I understand that adds 4 characters for every appointment you write and read, but it seems clearer to me.
  • Chris Conlan
    Chris Conlan almost 4 years
    hash(-1) == hash(-2) still exists today. Fortunately, it doesn't adversely affect dictionary and set lookups. All other integers i resolve to themselves for hash(i) except -1.