Why can a floating point dictionary key overwrite an integer key with the same value?
Solution 1
You should consider that the dict
aims at storing data depending on the logical numeric value, not on how you represented it.
The difference between int
s and float
s is indeed just an implementation detail and not conceptual. Ideally the only number type should be an arbitrary precision number with unbounded accuracy even sub-unity... this is however hard to implement without getting into troubles... but may be that will be the only future numeric type for Python.
So while having different types for technical reasons Python tries to hide these implementation details and int
->float
conversion is automatic.
It would be much more surprising if in a Python program if x == 1: ...
wasn't going to be taken when x
is a float
with value 1.
Note that also with Python 3 the value of 1/2
is 0.5
(the division of two integers) and that the types long
and non-unicode string have been dropped with the same attempt to hide implementation details.
Solution 2
First of all: the behaviour is documented explicitly in the docs for the hash function:
hash(object)
Return the hash value of the object (if it has one). Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup. Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for
1
and1.0
).
Secondly, a limitation of hashing is pointed out in the docs for object.__hash__
object.__hash__(self)
Called by built-in function
hash()
and for operations on members of hashed collections includingset
,frozenset
, anddict. __hash__()
should return an integer. The only required property is that objects which compare equal have the same hash value;
This is not unique to python. Java has the same caveat: if you implement hashCode
then, in order for things to work correctly, you must implement it in such a way that: x.equals(y)
implies x.hashCode() == y.hashCode()
.
So, python decided that 1.0 == 1
holds, hence it's forced to provide an implementation for hash
such that hash(1.0) == hash(1)
. The side effect is that 1.0
and 1
act exactly in the same way as dict
keys, hence the behaviour.
In other words the behaviour in itself doesn't have to be used or useful in any way. It is necessary. Without that behaviour there would be cases where you could accidentally overwrite a different key.
If we had 1.0 == 1
but hash(1.0) != hash(1)
we could still have a collision. And if 1.0
and 1
collide, the dict
will use equality to be sure whether they are the same key or not and kaboom the value gets overwritten even if you intended them to be different.
The only way to avoid this would be to have 1.0 != 1
, so that the dict
is able to distinguish between them even in case of collision. But it was deemed more important to have 1.0 == 1
than to avoid the behaviour you are seeing, since you practically never use float
s and int
s as dictionary keys anyway.
Since python tries to hide the distinction between numbers by automatically converting them when needed (e.g. 1/2 -> 0.5
) it makes sense that this behaviour is reflected even in such circumstances. It's more consistent with the rest of python.
This behaviour would appear in any implementation where the matching of the keys is at least partially (as in a hash map) based on comparisons.
For example if a dict
was implemented using a red-black tree or an other kind of balanced BST, when the key 1.0
is looked up the comparisons with other keys would return the same results as for 1
and so they would still act in the same way.
Hash maps require even more care because of the fact that it's the value of the hash that is used to find the entry of the key and comparisons are done only afterwards. So breaking the rule presented above means you'd introduce a bug that's quite hard to spot because at times the dict
may seem to work as you'd expect it, and at other times, when the size changes, it would start to behave incorrectly.
Note that there would be a way to fix this: have a separate hash map/BST for each type inserted in the dictionary. In this way there couldn't be any collisions between objects of different type and how ==
compares wouldn't matter when the arguments have different types.
However this would complicate the implementation, it would probably be inefficient since hash maps have to keep quite a few free locations in order to have O(1) access times. If they become too full the performances decrease. Having multiple hash maps means wasting more space and also you'd need to first choose which hash map to look at before even starting the actual lookup of the key.
If you used BSTs you'd first have to lookup the type and the perform a second lookup. So if you are going to use many types you'd end up with twice the work (and the lookup would take O(log n) instead of O(1)).
Solution 3
In python:
1==1.0
True
This is because of implicit casting
However:
1 is 1.0
False
I can see why automatic casting between float
and int
is handy, It is relatively safe to cast int
into float
, and yet there are other languages (e.g. go) that stay away from implicit casting.
It is actually a language design decision and a matter of taste more than different functionalities
Solution 4
Dictionaries are implemented with a hash table. To look up something in a hash table, you start at the position indicated by the hash value, then search different locations until you find a key value that's equal or an empty bucket.
If you have two key values that compare equal but have different hashes, you may get inconsistent results depending on whether the other key value was in the searched locations or not. For example this would be more likely as the table gets full. This is something you want to avoid. It appears that the Python developers had this in mind, since the built-in hash
function returns the same hash for equivalent numeric values, no matter if those values are int
or float
. Note that this extends to other numeric types, False
is equal to 0
and True
is equal to 1
. Even fractions.Fraction
and decimal.Decimal
uphold this property.
The requirement that if a == b
then hash(a) == hash(b)
is documented in the definition of object.__hash__()
:
Called by built-in function
hash()
and for operations on members of hashed collections includingset
,frozenset
, anddict
.__hash__()
should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.
TL;DR: a dictionary would break if keys that compared equal did not map to the same value.
Solution 5
Frankly, the opposite is dangerous! 1 == 1.0
, so it's not improbable to imagine that if you had them point to different keys and tried to access them based on an evaluated number then you'd likely run into trouble with it because the ambiguity is hard to figure out.
Dynamic typing means that the value is more important than what the technical type of something is, since the type is malleable (which is a very useful feature) and so distinguishing both ints
and floats
of the same value as distinct is unnecessary semantics that will only lead to confusion.
sjdenny
Updated on August 12, 2020Comments
-
sjdenny over 3 years
I'm working through http://www.mypythonquiz.com, and question #45 asks for the output of the following code:
confusion = {} confusion[1] = 1 confusion['1'] = 2 confusion[1.0] = 4 sum = 0 for k in confusion: sum += confusion[k] print sum
The output is
6
, since the key1.0
replaces1
. This feels a bit dangerous to me, is this ever a useful language feature?-
John Coleman over 8 yearsusing the built-in
sum
as a variable is also (slightly) confusing. I would distrust code that tries to use floats as int keys even if the language permits it, even though I am sure that there are some cases where it might be useful. -
MrAlexBailey over 8 yearsThis would only hurt you in situations where you would consider
1.0
and1
to be different. Given the (seemingly) rarity of these situations it's understandable that the default behavior is to treat them as equal. -
Mark Ransom over 8 years@JohnColeman I would be suspicious of any dictionary that contains disjoint types as keys.
-
John Coleman over 8 years@MarkRansom Good point. 99% of the times I just use strings or ints (or various tuples built up from them) anyway. I don't really trust float keys.
-
sjdenny over 8 years@Bakuriu Your title edit has changed the spirit of the question, my interest was principally whether this behaviour is useful/how it fits in with the rest of the Python language, rather than just 'why' this behaviour occurs. Still, many good answers have arisen regarding the original question.
-
Bakuriu over 8 yearsTitle and questions are two different things. The question body should not depend on the title. The title should be an English sentence summarizing the main point of the question (see here). A title like "Tag2 in python Tag3 vs Tag4" isn't a good title, even if it may sound catchy. If you think the current title doesn't express what you originally wanted to say change it, but it should be a sentence. More importantly, if you think the question has gotten a "wrong turn" in its answers, you should clarify it in its body with an explicit statement.
-
-
Mark Ransom over 8 years
int->float
promotion should be automatic in a context where it's necessary, but in this case I'd argue it isn't. What happens when you try to put anint
into a dictionary that is outside the bounds of afloat
, or that can't round-trip? -
6502 over 8 years
is
with numbers is not a good idea... for example afterx=1000000
the expressionx is 1000000
isFalse
. -
6502 over 8 years@MarkRansom: IMO the dictionary should store the values depending on the logical numeric value. Some numeric values can be represented with both
int
andfloats
and some only with one or the other. -
Mark Ransom over 8 yearsI've changed my mind and left my reasoning as an answer.
-
supercat over 8 yearsGiven x=1.0, and y=1, would x*123456789*123456789*123456789 and y*123456789*123456789*123456789 yield the same values or different values? While it makes sense to have a "mathematical" equality operator report that 1.0 and 1 represent the same value, a good language should also have a means of comparison which would recognize them as distinct.
-
SuperBiasedMan over 8 years@supercat Python does have the means to differentiate between the two of them, it just doesn't employ it when determining hash value.
-
Jeremy over 8 years"The difference between ints and floats is indeed just an implementation detail and not conceptual." I disagree (though it's true for ints and longs). Ints and floats have explicitly different behaviour for division (as you've noted) and only ints provide the
.bit_length()
method. Floats are also not allowed to be used as array indices -- if they were meant to be they should implement__index__
and only raise an error for non-integer values. These are definitely conceptual differences, not just implementation differences. -
justhalf over 8 yearsI think this explains the thing going on better
-
6502 over 8 years@JeremyBanks: the difference for division was a "design bug" in Python 2.x that could not be fixed for backward compatibility and it has been fixed asap (i.e. in Python 3.x). I'd say that also the fact that you cannot use
3.0
as an array index is a bug and not a feature but may be Guido disagrees on this. Of course there are differences (e.g.type(3)
andtype(3.0)
are not the same)... the point is if they are incidental differences (that we'd love to get rid of) or if they are desired differences... -
leftaroundabout over 8 years@6502: would you also like to be able to use
3.000000000000001
as an array index? Or2.999999999999999
, or3.141592653589793
? If no, I don't think you should be happy with3.0
either. -
6502 over 8 years@leftaroundabout: if you like
int
andfloat
to be different types then you shouldn't be happy with3 == 3.0
either; but this is IMO highly annoying (even if OCaml guys think differently). If3 == 3.0
thenx[3]
should be the same asx[3.0]
too.3.0000000001
on the other hand is something different and having it raising an error could help debugging problems. BTW note that double-precision numbers can represent exactly all integers with an absolute value less than 2^53... i.e. 9,007,199,254,740,992 (we're not going to have arrays that big around for quite a while). -
6502 over 8 yearsI find the discussion on hashing quite irrelevant. That is an implementation detail and an hash function that returns
42
for every value would be a valid albeit inefficient hash (thus you can never decide anything based on the hash value). The key point is that in Python3 == 3.0
and that dictionary works on equality. -
Bakuriu over 8 years@6502 I've added a few paragraphs at the end. See if they satisfy you. In any case, since the OP is asking about python's
dict
and not a generic concept of mapping I think hashing is indeed relevant. -
user253751 over 8 yearsA dict is a key/value store which cannot contain two entries with equal keys. Therefore, since 1 and 1.0 are equal, it cannot contain both. There, explained it without explaining how hashtables work...
-
Mark Ransom over 8 years@6502 but it's an important implementation detail, one that has huge practical consequences. A dictionary that sometimes treats
3==3.0
and sometimes doesn't would be a bad thing, and preventing it forces you to give them the same hash. -
6502 over 8 years@MarkRansom: sorry but probably I think my English is better than what it actually is. The hashing detail is irrelevant... what is relevant is equality comparison and that
3 == 3.0
. A dict in python uses equality comparison to match keys (this is the key point) and that's whyx[3]
andx[3.0]
are the same for a dict. It would be the same even without hash and just with linear scanning. -
Bakuriu over 8 years@6502 Except that a
dict
is not documented as a simple mapping but as an hash map, hence how hash map behaves is relevant independently on how other structures would behave. Besides I've already added what you call your main point, so I'm not going to remove useful information from the answer. -
leftaroundabout over 8 years@6502: I actually computed that number as
6 * cos(pi/4)**2
. That's mathematically exactly three, but floating point can not represent this fact. Contrived example, sure, but if you don't need any such “inexact” calculations then why do you want floating-point in the first place? It's a bit of a philosophical question; my stance is that floats never represent any numbers exactly, rather they always represent a tiny interval of numbers. Casting an exact type likeint
to float means you're choosing an interval which contains that int, but the reverse is not (uniquely) possible. -
Mark Ransom over 8 years@6502 the equality comparison will never be performed if you're looking at the wrong items. If the hashes don't match, you'll look in the wrong place and most of the time you won't get the item you wish to find. Certainly a linear scan would work, but it would be far too slow to be practical, that's why a hash map is used.
-
6502 over 8 years@MarkRansom: an hash function that returns different outputs for inputs that compare equal is invalid (like also it would be invalid one that returns time.time() or random.random() or that calls sys.exit(1)). As perfectly explained by @immibis there is no real relation between hash and the fact that a
dict
cannot contain both 3 and 3.0 in different slots... the issue is in__eq__
, not in__hash__
. -
Bruno Feroleto over 6 yearsAgreed, @6502. The original question is indeed readily answered through the offficial semantics of Python dictionaries, which says that two equal numeric keys yield the same dictionary value (docs.python.org/3.6/reference/…). Mentioning hashes only delves into technicalities related to implementation choices. That's interesting but not fundamental to answering the question.
-
Kim over 6 years@6502 The hashing is in fact the only reason for this behaviour. Limiting the answer to the equality only would keep the implementational truth secret; that dictionaries use the key's hash values to retrieve the values stored in them.