Hashing a dictionary?
Solution 1
If your dictionary is not nested, you could make a frozenset with the dict's items and use hash()
:
hash(frozenset(my_dict.items()))
This is much less computationally intensive than generating the JSON string or representation of the dictionary.
UPDATE: Please see the comments below, why this approach might not produce a stable result.
Solution 2
Using sorted(d.items())
isn't enough to get us a stable repr. Some of the values in d
could be dictionaries too, and their keys will still come out in an arbitrary order. As long as all the keys are strings, I prefer to use:
json.dumps(d, sort_keys=True)
That said, if the hashes need to be stable across different machines or Python versions, I'm not certain that this is bulletproof. You might want to add the separators
and ensure_ascii
arguments to protect yourself from any changes to the defaults there. I'd appreciate comments.
Solution 3
EDIT: If all your keys are strings, then before continuing to read this answer, please see Jack O'Connor's significantly simpler (and faster) solution (which also works for hashing nested dictionaries).
Although an answer has been accepted, the title of the question is "Hashing a python dictionary", and the answer is incomplete as regards that title. (As regards the body of the question, the answer is complete.)
Nested Dictionaries
If one searches Stack Overflow for how to hash a dictionary, one might stumble upon this aptly titled question, and leave unsatisfied if one is attempting to hash multiply nested dictionaries. The answer above won't work in this case, and you'll have to implement some sort of recursive mechanism to retrieve the hash.
Here is one such mechanism:
import copy
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that contains
only other hashable types (including any lists, tuples, sets, and
dictionaries).
"""
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
Bonus: Hashing Objects and Classes
The hash()
function works great when you hash classes or instances. However, here is one issue I found with hash, as regards objects:
class Foo(object): pass
foo = Foo()
print (hash(foo)) # 1209812346789
foo.a = 1
print (hash(foo)) # 1209812346789
The hash is the same, even after I've altered foo. This is because the identity of foo hasn't changed, so the hash is the same. If you want foo to hash differently depending on its current definition, the solution is to hash off whatever is actually changing. In this case, the __dict__
attribute:
class Foo(object): pass
foo = Foo()
print (make_hash(foo.__dict__)) # 1209812346789
foo.a = 1
print (make_hash(foo.__dict__)) # -78956430974785
Alas, when you attempt to do the same thing with the class itself:
print (make_hash(Foo.__dict__)) # TypeError: unhashable type: 'dict_proxy'
The class __dict__
property is not a normal dictionary:
print (type(Foo.__dict__)) # type <'dict_proxy'>
Here is a similar mechanism as previous that will handle classes appropriately:
import copy
DictProxyType = type(object.__dict__)
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that
contains only other hashable types (including any lists, tuples, sets, and
dictionaries). In the case where other kinds of objects (like classes) need
to be hashed, pass in a collection of object attributes that are pertinent.
For example, a class can be hashed in this fashion:
make_hash([cls.__dict__, cls.__name__])
A function can be hashed like so:
make_hash([fn.__dict__, fn.__code__])
"""
if type(o) == DictProxyType:
o2 = {}
for k, v in o.items():
if not k.startswith("__"):
o2[k] = v
o = o2
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
You can use this to return a hash tuple of however many elements you'd like:
# -7666086133114527897
print (make_hash(func.__code__))
# (-7666086133114527897, 3527539)
print (make_hash([func.__code__, func.__dict__]))
# (-7666086133114527897, 3527539, -509551383349783210)
print (make_hash([func.__code__, func.__dict__, func.__name__]))
NOTE: all of the above code assumes Python 3.x. Did not test in earlier versions, although I assume make_hash()
will work in, say, 2.7.2. As far as making the examples work, I do know that
func.__code__
should be replaced with
func.func_code
Solution 4
The code below avoids using the Python hash() function because it will not provide hashes that are consistent across restarts of Python (see hash function in Python 3.3 returns different results between sessions). make_hashable()
will convert the object into nested tuples and make_hash_sha256()
will also convert the repr()
to a base64 encoded SHA256 hash.
import hashlib
import base64
def make_hash_sha256(o):
hasher = hashlib.sha256()
hasher.update(repr(make_hashable(o)).encode())
return base64.b64encode(hasher.digest()).decode()
def make_hashable(o):
if isinstance(o, (tuple, list)):
return tuple((make_hashable(e) for e in o))
if isinstance(o, dict):
return tuple(sorted((k,make_hashable(v)) for k,v in o.items()))
if isinstance(o, (set, frozenset)):
return tuple(sorted(make_hashable(e) for e in o))
return o
o = dict(x=1,b=2,c=[3,4,5],d={6,7})
print(make_hashable(o))
# (('b', 2), ('c', (3, 4, 5)), ('d', (6, 7)), ('x', 1))
print(make_hash_sha256(o))
# fyt/gK6D24H9Ugexw+g3lbqnKZ0JAcgtNW+rXIDeU2Y=
Solution 5
Here is a clearer solution.
def freeze(o):
if isinstance(o,dict):
return frozenset({ k:freeze(v) for k,v in o.items()}.items())
if isinstance(o,list):
return tuple([freeze(v) for v in o])
return o
def make_hash(o):
"""
makes a hash out of anything that contains only list,dict and hashable types including string and numeric types
"""
return hash(freeze(o))
ThiefMaster
🐈 Python/Web developer at CERN 😻 Gamer 😻 Hobby photographer cosplay/concerts
Updated on April 07, 2022Comments
-
ThiefMaster about 2 years
For caching purposes I need to generate a cache key from GET arguments which are present in a dict.
Currently I'm using
sha1(repr(sorted(my_dict.items())))
(sha1()
is a convenience method that uses hashlib internally) but I'm curious if there's a better way.-
Andrey Fedorov about 10 yearsthis might not work with nested dict. shortest solution is to use json.dumps(my_dict, sort_keys=True) instead, which will recurse into dict values.
-
Matthew Cornell over 9 yearsFYI re: dumps, stackoverflow.com/a/12739361/1082367 says "The output from pickle is not guaranteed to be canonical for similar reasons to dict and set order being non-deterministic. Don't use pickle or pprint or repr for hashing."
-
nyuwec about 8 yearssort the dict keys, not the items, i would also send the keys to the hash function.
-
FluxLemur about 6 yearsInteresting backstory about hashing mutable data structures (like dictionaries): python.org/dev/peps/pep-0351 was proposed to allow arbitrarily freezing objects, but rejected. For rationale, see this thread in python-dev: mail.python.org/pipermail/python-dev/2006-February/060793.html
-
Oliver almost 4 yearsIf your data is json format, and you want semantically invariant hashing, checkout github.com/schollii/sandals/blob/master/json_sem_hash.py. It works on nested structures (of course, since json), and does not depend on internals of dict like preserved order (which has evolved over the lifetime of python), and will give the same hash if two data structures are semantically the same (like
{'a': 1, 'b':2}
is semantically the same as{'b':2, 'a':1}
). I haven't used it on anything too complicated yet so YMMV but feedback welcome. -
shelper almost 3 yearsdoes not work for me with
d={'a': 'a', 'b': 'b'}; hashlib.md5(frozenset(d.items()))
, gives errorTypeError: object supporting the buffer API required
-
ThiefMaster almost 3 years@shelper you forgot the
repr()
(and possibly a.encode()
in python 3)
-
-
spatel over 12 yearsThis didn't work for me with a nested dictionary. I haven't tried the solution below (too complicated). The OP's solution works perfectly fine. I substituted sha1 with hash to save an import.
-
Ceasar Bautista over 12 yearsYou can also just use hash(tuple(my_dict.items())) to save a few characters for non-nested items.
-
Antimony almost 12 years@Ceaser That won't work because tuple implies ordering but dict items are unordered. frozenset is better.
-
Xealot over 11 yearsisinstance takes a sequence for the second argument, so isinstance(o, (set, tuple, list)) would work.
-
Xealot over 11 yearsthanks for making me realize frozenset could consistently hash querystring parameters :)
-
Bas Koopmans over 10 yearsThe items need to be sorted in order to create the same hash if the dict item order is different but the key values aren't -> return hash(tuple(frozenset(sorted(new_o.items()))))
-
Andrey Fedorov about 10 yearsThis seems like the best solution, but could you expound on why you think separators and ensure_ascii might be useful?
-
Jack O'Connor about 10 yearsThis is just being paranoid, but JSON allow most characters to show up in strings without any literal escaping, so the encoder gets to make some choices about whether to escape characters or just pass them through. The risk then is that different versions (or future versions) of the encoder could make different escaping choices by default, and then your program would compute different hash values for the same dictionary in different environments. The
ensure_ascii
argument would protect against this entirely hypothetical problem. -
Jack O'Connor about 10 yearsKind of unrelated, but you also need to worry about your encoder's unicode escaping if you plan on
eval
ing the JSON output as literal JS. Fun times. timelessrepo.com/json-isnt-a-javascript-subset -
glarrain almost 10 yearsWhy do you think it matters that
dict.items
does not return a predictably ordered list?frozenset
takes care of that -
glarrain almost 10 yearsA set, by definition, is unordered. Thus the order in which objects are added is irrelevant. You do have to realize that the built-in function
hash
does not care about how the frozenset contents are printed or something like that. Test it in several machines and python versions and you'll see. -
Tommaso Barbugli almost 10 yearsthats really neat! one line and you get nested dicts sorted too :)
-
charlax almost 10 yearsI tested the performance of this with different dataset, it's much much faster than
make_hash
. gist.github.com/charlax/b8731de51d2ea86c6eb9 -
Sergey Orshanskiy over 9 yearsNice! I also added a call to
hash
around lists and tuples. Otherwise it takes my lists of integers that happen to be values in my dictionary, and returns back lists of hashes, which is not what I want. -
ThiefMaster over 9 yearsWho guarantees that pformat or json always use the same order?
-
B Robster about 9 yearsBeware of the built-in hash if something needs to be consistent across different machines. Implementations of python on cloud platforms like Heroku and GAE will return different values for hash() on different instances making it useless for anything that must be shared between two or more "machines" (dynos in the case of heroku)
-
Hermann Schachner about 9 yearsIt might be interesting the
hash()
function does not produce a stable output. This means that, given the same input, it returns different results with different instances of the same python interpreter. To me, it looks like some sort of seed value is generated every time the interpreter is started. -
Nikokrock about 9 yearsexpected. the seed is introduced for security reason as far as I remember to add some kind of memory randomization. So you cannot expect the hash to be the same between two python processes
-
arthurprs almost 9 years@charlax ujson doesn't guarantee the order of the dict pairs, so it's not safe to do that
-
Daniel777 almost 9 yearsthis solution does not solve the general problem of get a hash of a dictionary
-
Arel over 8 years@ThiefMaster, "Changed in version 2.5: Dictionaries are sorted by key before the display is computed; before 2.5, a dictionary was sorted only if its display required more than one line, although that wasn’t documented."(docs.python.org/2/library/pprint.html)
-
RobM about 8 yearsA frozenset is an UNORDERED collection, so there is nothing to gain by sorting its inputs. Lists and tuples on the other hand are ORDERED collections ("sequences"), and so the hash value should be affected by the order of items within. You shouldn't sort them!
-
jomido about 8 years@RobM: ah! So it should be hash(tuple(sorted(frozenset(new_o.items())))). Comment?
-
kadee about 8 yearsThis solution only works as long as all keys are strings, e.g. json.dumps({'a': {(0, 5): 5, 1: 3}}) fails.
-
kkurian about 8 years@JackO'Connor Thankfully,
ensure_ascii
defaults toTrue
at least as far back as 2.6. -
RuiDo almost 8 yearsWhy do you use the extra hash() call in value = hash('%s::%s' % (value, type(value)))??
-
fiatjaf almost 8 yearsIf your dictionary has just one key,
hash(repr(my_dict))
will work also. -
mhristache over 7 yearsCan someone explain what is so wrong with this method?
-
Vlad Frolov over 7 years@maximi Dictionaries are not stable it terms of order, thus
hash(str({'a': 1, 'b': 2})) != hash(str({'b': 2, 'a': 1}))
(while it might work for some dictionaries, it is not guaranteed to work on all). -
Messa about 7 years@HermannSchachner That's correct - random seed for hash() was introduced in Python 3.3 for security reasons: stackoverflow.com/a/27522708/196206 That renders this answer unusable (except if you need hash stability only during a single interpreter lifetime).
-
David Sanders about 7 yearsThis doesn't seem valid to me. The pprint modules and pformat are understood by the authors to be for display purposes and not serialization. Because of this, you shouldn't feel safe in assuming that pformat will always return a result that happens to work.
-
Lorenzo Belli about 7 yearsSome datatype are not json serializable like datetime.datetime.
-
Izkata about 7 yearsUsing
foo.__dict__
isn't a good idea; that includes everything, including attributes that aren't necessarily relevant to the state (cache, for example).Foo
should define its own__hash__
method, and include only the values relevant for determining if twoFoo
s are the same object or not. -
Poik almost 7 years
make_hash_sha256(((0,1),(2,3)))==make_hash_sha256({0:1,2:3})==make_hash_sha256({2:3,0:1})!=make_hash_sha256(((2,3),(0,1)))
. This isn't quite the solution I'm looking for, but it is a nice intermediate. I'm thinking of addingtype(o).__name__
to the beginning of each of the tuples to force differentiation. -
mlissner about 6 years@LorenzoBelli, you can overcome that by adding
default=str
to thedumps
command. Seems to work nicely. -
James Hirschorn over 5 yearsThis doesn't work, for example, for a
dict
ofDataFrame
objects. -
Eric over 5 years@JamesHirschorn: Updated to fail loudly
-
James Hirschorn over 5 yearsBetter! I added the following
elif
clause to make it work withDataFrame
s:elif isinstance(x, pd.DataFrame): return make_hashable(hash_pandas_object(x).tolist())
I will edit the answer and see if you accept it... -
James Hirschorn over 5 yearsNote that
DataFrame
is an instance ofHashable
since it has a__hash__
function, even though the function returns an error message. -
Eric over 5 yearsI rolled back your change, and provided an alternative that's more extensible
-
James Hirschorn over 5 yearsUnfortunately, I have found that this approach fails. For example,
<frozendict {'a': 2, 'b': 1}>
returns a different hash value every time the python script runs. -
James Hirschorn over 5 yearsOK. I see I was asking for something more than "hashable", which only guarantees that objects that are equal shall have the same hash. I am working on a version that will give the the same value between runs, and independent of python version, etc..
-
Eric over 5 years
hash
randomization is deliberate security feature enabled by default in python 3.7. -
kolypto about 5 yearsIn Python 3.7, this solution peforms worse than
hash(json.dumps(d, sort_keys=True))
by a factor of 18. Proof: pastebin.com/XDZGhNHs -
Suraj almost 5 yearsThe library doesn't sort the list inside a dict. And hence this could produce different hash. There should be an option to sort a list too. A frozenset should help, but I wonder how would you handle the case with a nested dict containing a list of dicts. As dict are unhashable.
-
Suraj almost 5 yearsIf you want to sort the list as well :
tuple(sorted((make_hashable(e) for e in o)))
-
jtlz2 over 4 yearsmake_hash_sha256() - nice!
-
scottclowe over 4 years@Suraj You shouldn't want to sort the list before hashing because lists which have their contents in different orders are definitely not the same thing. If the order of the items does not matter the problem is that you are using the wrong data structure. You should be using a set instead of a list.
-
Suraj over 4 years@scottclowe That is very true. Thanks for adding that point. There are 2 scenarios where you'd still want a list (without specific ordering needs) - 1. List of repeating items. 2. When you've to use a JSON directly. As JSON doesn't support "set" representation.
-
pcworld over 4 yearsNote that Python's hash() is not a cryptographic hash function and thus not collision resistant. When a hash matches, you will still have to compare that the value associated with the hash is the expected one.
-
Pedro Cattori over 4 years@Suraj : it does handle nested structure via
.recurse
. See maps.readthedocs.io/en/latest/api.html#maps.FrozenMap.recurse . Ordering in lists is semantically meaningful, if you want order independence you can convert your lists to sets prior to calling.recurse
. You can also make use of thelist_fn
parameter to.recurse
to use a different hashable data structure thantuple
(.e.gfrozenset
) -
DylanYoung over 4 years@mlissner Then you have two non-equal dicts that generate the same hash (one with integer keys, one with string keys)
-
Peter Schorn about 4 yearsIf you change
if isinstance(o,list):
toif isinstance(obj, (set, tuple, list)):
then this function can work on any object. -
garciparedes almost 4 yearsHi! I've a question, is it really needed to
deepcopy
at each level of recursion? I don't think so. Removing that command the performance would improve a lot. -
imbr over 3 yearsfor my use case that's more than enough
-
user202729 over 3 yearsIt's understandable that this solution is faster with recursive dict/similar because json is implemented is C, but because of the redundant work it can be expected that a recursive-dict-hashing module implemented in C would be faster than this.
-
Ham almost 3 yearsNote that
json.dumps()
requires the input dictionary (nested or not) to be serializable , and if the key is not string (e.g. it is a number like integer or float) ,json.dumps()
automatically converts the number to string . Also the key cannot be a tuple as shown in @kadee comment -
MestreLion over 2 yearsAmazing answer and nice ideas! But, about hashing
foo.__dict__
, this is only possible for classes that don't define__slots__
. And many classes used in performance-sensitive scenarios (i.e., the classes you care about hashing performance) do define it. -
marco over 2 yearsSet PYTHONHASHSEED=0 then, if you have access to the environment at all.
-
Tom Wojcik over 2 years@marco well, that doesn't sound like something anyone wants. > Specifying the value 0 will disable hash randomization.
-
marco over 2 yearsIt will NOT disable hash randomization, just seed randomization and it's the only way to get randomization with same results on different machines, which you say is what you want to achieve (maybe I am misunderstanding you). In any case, both don't seem to have much to do with the original question.
-
VBobCat about 2 years@DylanYoung maybe
default=repr
would be better as for distinguishing among int and str numbers -
DylanYoung about 2 yearsThat would probably help @VBobCat, but the fundamental issue remains that any dictionary containing the same string as the
default
produces would cause hash collisions. Probably the safest bet would be to prepend some unique key (something very unlikely to ever be valid string key) or to transformv
itself to such a thing (maybe a base64 encoding of therepr
would do for most cases?), or both:default=lambda v: PREFIX + get_unique(v)
.default=lambda v: repr(v) + str(hash(v))
might be good enough for most use-cases, though hash collisions are possible of course. -
timgeb about 2 yearsNote that
{'a': 1} == {'a': 1.0}
->True
but your solution dumps different strings.