Why can a floating point dictionary key overwrite an integer key with the same value?

20,546

Solution 1

You should consider that the dict aims at storing data depending on the logical numeric value, not on how you represented it.

The difference between ints and floats is indeed just an implementation detail and not conceptual. Ideally the only number type should be an arbitrary precision number with unbounded accuracy even sub-unity... this is however hard to implement without getting into troubles... but may be that will be the only future numeric type for Python.

So while having different types for technical reasons Python tries to hide these implementation details and int->float conversion is automatic.

It would be much more surprising if in a Python program if x == 1: ... wasn't going to be taken when x is a float with value 1.

Note that also with Python 3 the value of 1/2 is 0.5 (the division of two integers) and that the types long and non-unicode string have been dropped with the same attempt to hide implementation details.

Solution 2

First of all: the behaviour is documented explicitly in the docs for the hash function:

hash(object)

Return the hash value of the object (if it has one). Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup. Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0).

Secondly, a limitation of hashing is pointed out in the docs for object.__hash__

object.__hash__(self)

Called by built-in function hash() and for operations on members of hashed collections including set, frozenset, and dict. __hash__() should return an integer. The only required property is that objects which compare equal have the same hash value;

This is not unique to python. Java has the same caveat: if you implement hashCode then, in order for things to work correctly, you must implement it in such a way that: x.equals(y) implies x.hashCode() == y.hashCode().

So, python decided that 1.0 == 1 holds, hence it's forced to provide an implementation for hash such that hash(1.0) == hash(1). The side effect is that 1.0 and 1 act exactly in the same way as dict keys, hence the behaviour.

In other words the behaviour in itself doesn't have to be used or useful in any way. It is necessary. Without that behaviour there would be cases where you could accidentally overwrite a different key.

If we had 1.0 == 1 but hash(1.0) != hash(1) we could still have a collision. And if 1.0 and 1 collide, the dict will use equality to be sure whether they are the same key or not and kaboom the value gets overwritten even if you intended them to be different.

The only way to avoid this would be to have 1.0 != 1, so that the dict is able to distinguish between them even in case of collision. But it was deemed more important to have 1.0 == 1 than to avoid the behaviour you are seeing, since you practically never use floats and ints as dictionary keys anyway.

Since python tries to hide the distinction between numbers by automatically converting them when needed (e.g. 1/2 -> 0.5) it makes sense that this behaviour is reflected even in such circumstances. It's more consistent with the rest of python.


This behaviour would appear in any implementation where the matching of the keys is at least partially (as in a hash map) based on comparisons.

For example if a dict was implemented using a red-black tree or an other kind of balanced BST, when the key 1.0 is looked up the comparisons with other keys would return the same results as for 1 and so they would still act in the same way.

Hash maps require even more care because of the fact that it's the value of the hash that is used to find the entry of the key and comparisons are done only afterwards. So breaking the rule presented above means you'd introduce a bug that's quite hard to spot because at times the dict may seem to work as you'd expect it, and at other times, when the size changes, it would start to behave incorrectly.


Note that there would be a way to fix this: have a separate hash map/BST for each type inserted in the dictionary. In this way there couldn't be any collisions between objects of different type and how == compares wouldn't matter when the arguments have different types.

However this would complicate the implementation, it would probably be inefficient since hash maps have to keep quite a few free locations in order to have O(1) access times. If they become too full the performances decrease. Having multiple hash maps means wasting more space and also you'd need to first choose which hash map to look at before even starting the actual lookup of the key.

If you used BSTs you'd first have to lookup the type and the perform a second lookup. So if you are going to use many types you'd end up with twice the work (and the lookup would take O(log n) instead of O(1)).

Solution 3

In python:

1==1.0
True

This is because of implicit casting

However:

1 is 1.0
False

I can see why automatic casting between float and int is handy, It is relatively safe to cast int into float, and yet there are other languages (e.g. go) that stay away from implicit casting.

It is actually a language design decision and a matter of taste more than different functionalities

Solution 4

Dictionaries are implemented with a hash table. To look up something in a hash table, you start at the position indicated by the hash value, then search different locations until you find a key value that's equal or an empty bucket.

If you have two key values that compare equal but have different hashes, you may get inconsistent results depending on whether the other key value was in the searched locations or not. For example this would be more likely as the table gets full. This is something you want to avoid. It appears that the Python developers had this in mind, since the built-in hash function returns the same hash for equivalent numeric values, no matter if those values are int or float. Note that this extends to other numeric types, False is equal to 0 and True is equal to 1. Even fractions.Fraction and decimal.Decimal uphold this property.

The requirement that if a == b then hash(a) == hash(b) is documented in the definition of object.__hash__():

Called by built-in function hash() and for operations on members of hashed collections including set, frozenset, and dict. __hash__() should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.

TL;DR: a dictionary would break if keys that compared equal did not map to the same value.

Solution 5

Frankly, the opposite is dangerous! 1 == 1.0, so it's not improbable to imagine that if you had them point to different keys and tried to access them based on an evaluated number then you'd likely run into trouble with it because the ambiguity is hard to figure out.

Dynamic typing means that the value is more important than what the technical type of something is, since the type is malleable (which is a very useful feature) and so distinguishing both ints and floats of the same value as distinct is unnecessary semantics that will only lead to confusion.

Share:
20,546
sjdenny
Author by

sjdenny

Updated on August 12, 2020

Comments

  • sjdenny
    sjdenny over 3 years

    I'm working through http://www.mypythonquiz.com, and question #45 asks for the output of the following code:

    confusion = {}
    confusion[1] = 1
    confusion['1'] = 2
    confusion[1.0] = 4
    
    sum = 0
    for k in confusion:
        sum += confusion[k]
    
    print sum
    

    The output is 6, since the key 1.0 replaces 1. This feels a bit dangerous to me, is this ever a useful language feature?

    • John Coleman
      John Coleman over 8 years
      using the built-in sum as a variable is also (slightly) confusing. I would distrust code that tries to use floats as int keys even if the language permits it, even though I am sure that there are some cases where it might be useful.
    • MrAlexBailey
      MrAlexBailey over 8 years
      This would only hurt you in situations where you would consider 1.0 and 1 to be different. Given the (seemingly) rarity of these situations it's understandable that the default behavior is to treat them as equal.
    • Mark Ransom
      Mark Ransom over 8 years
      @JohnColeman I would be suspicious of any dictionary that contains disjoint types as keys.
    • John Coleman
      John Coleman over 8 years
      @MarkRansom Good point. 99% of the times I just use strings or ints (or various tuples built up from them) anyway. I don't really trust float keys.
    • sjdenny
      sjdenny over 8 years
      @Bakuriu Your title edit has changed the spirit of the question, my interest was principally whether this behaviour is useful/how it fits in with the rest of the Python language, rather than just 'why' this behaviour occurs. Still, many good answers have arisen regarding the original question.
    • Bakuriu
      Bakuriu over 8 years
      Title and questions are two different things. The question body should not depend on the title. The title should be an English sentence summarizing the main point of the question (see here). A title like "Tag2 in python Tag3 vs Tag4" isn't a good title, even if it may sound catchy. If you think the current title doesn't express what you originally wanted to say change it, but it should be a sentence. More importantly, if you think the question has gotten a "wrong turn" in its answers, you should clarify it in its body with an explicit statement.
  • Mark Ransom
    Mark Ransom over 8 years
    int->float promotion should be automatic in a context where it's necessary, but in this case I'd argue it isn't. What happens when you try to put an int into a dictionary that is outside the bounds of a float, or that can't round-trip?
  • 6502
    6502 over 8 years
    is with numbers is not a good idea... for example after x=1000000 the expression x is 1000000 is False.
  • 6502
    6502 over 8 years
    @MarkRansom: IMO the dictionary should store the values depending on the logical numeric value. Some numeric values can be represented with both int and floats and some only with one or the other.
  • Mark Ransom
    Mark Ransom over 8 years
    I've changed my mind and left my reasoning as an answer.
  • supercat
    supercat over 8 years
    Given x=1.0, and y=1, would x*123456789*123456789*123456789 and y*123456789*123456789*123456789 yield the same values or different values? While it makes sense to have a "mathematical" equality operator report that 1.0 and 1 represent the same value, a good language should also have a means of comparison which would recognize them as distinct.
  • SuperBiasedMan
    SuperBiasedMan over 8 years
    @supercat Python does have the means to differentiate between the two of them, it just doesn't employ it when determining hash value.
  • Jeremy
    Jeremy over 8 years
    "The difference between ints and floats is indeed just an implementation detail and not conceptual." I disagree (though it's true for ints and longs). Ints and floats have explicitly different behaviour for division (as you've noted) and only ints provide the .bit_length() method. Floats are also not allowed to be used as array indices -- if they were meant to be they should implement __index__ and only raise an error for non-integer values. These are definitely conceptual differences, not just implementation differences.
  • justhalf
    justhalf over 8 years
    I think this explains the thing going on better
  • 6502
    6502 over 8 years
    @JeremyBanks: the difference for division was a "design bug" in Python 2.x that could not be fixed for backward compatibility and it has been fixed asap (i.e. in Python 3.x). I'd say that also the fact that you cannot use 3.0 as an array index is a bug and not a feature but may be Guido disagrees on this. Of course there are differences (e.g. type(3) and type(3.0) are not the same)... the point is if they are incidental differences (that we'd love to get rid of) or if they are desired differences...
  • leftaroundabout
    leftaroundabout over 8 years
    @6502: would you also like to be able to use 3.000000000000001 as an array index? Or 2.999999999999999, or 3.141592653589793? If no, I don't think you should be happy with 3.0 either.
  • 6502
    6502 over 8 years
    @leftaroundabout: if you like int and float to be different types then you shouldn't be happy with 3 == 3.0 either; but this is IMO highly annoying (even if OCaml guys think differently). If 3 == 3.0 then x[3] should be the same as x[3.0] too. 3.0000000001 on the other hand is something different and having it raising an error could help debugging problems. BTW note that double-precision numbers can represent exactly all integers with an absolute value less than 2^53... i.e. 9,007,199,254,740,992 (we're not going to have arrays that big around for quite a while).
  • 6502
    6502 over 8 years
    I find the discussion on hashing quite irrelevant. That is an implementation detail and an hash function that returns 42 for every value would be a valid albeit inefficient hash (thus you can never decide anything based on the hash value). The key point is that in Python 3 == 3.0 and that dictionary works on equality.
  • Bakuriu
    Bakuriu over 8 years
    @6502 I've added a few paragraphs at the end. See if they satisfy you. In any case, since the OP is asking about python's dict and not a generic concept of mapping I think hashing is indeed relevant.
  • user253751
    user253751 over 8 years
    A dict is a key/value store which cannot contain two entries with equal keys. Therefore, since 1 and 1.0 are equal, it cannot contain both. There, explained it without explaining how hashtables work...
  • Mark Ransom
    Mark Ransom over 8 years
    @6502 but it's an important implementation detail, one that has huge practical consequences. A dictionary that sometimes treats 3==3.0 and sometimes doesn't would be a bad thing, and preventing it forces you to give them the same hash.
  • 6502
    6502 over 8 years
    @MarkRansom: sorry but probably I think my English is better than what it actually is. The hashing detail is irrelevant... what is relevant is equality comparison and that 3 == 3.0. A dict in python uses equality comparison to match keys (this is the key point) and that's why x[3] and x[3.0] are the same for a dict. It would be the same even without hash and just with linear scanning.
  • Bakuriu
    Bakuriu over 8 years
    @6502 Except that a dict is not documented as a simple mapping but as an hash map, hence how hash map behaves is relevant independently on how other structures would behave. Besides I've already added what you call your main point, so I'm not going to remove useful information from the answer.
  • leftaroundabout
    leftaroundabout over 8 years
    @6502: I actually computed that number as 6 * cos(pi/4)**2. That's mathematically exactly three, but floating point can not represent this fact. Contrived example, sure, but if you don't need any such “inexact” calculations then why do you want floating-point in the first place? It's a bit of a philosophical question; my stance is that floats never represent any numbers exactly, rather they always represent a tiny interval of numbers. Casting an exact type like int to float means you're choosing an interval which contains that int, but the reverse is not (uniquely) possible.
  • Mark Ransom
    Mark Ransom over 8 years
    @6502 the equality comparison will never be performed if you're looking at the wrong items. If the hashes don't match, you'll look in the wrong place and most of the time you won't get the item you wish to find. Certainly a linear scan would work, but it would be far too slow to be practical, that's why a hash map is used.
  • 6502
    6502 over 8 years
    @MarkRansom: an hash function that returns different outputs for inputs that compare equal is invalid (like also it would be invalid one that returns time.time() or random.random() or that calls sys.exit(1)). As perfectly explained by @immibis there is no real relation between hash and the fact that a dict cannot contain both 3 and 3.0 in different slots... the issue is in __eq__, not in __hash__.
  • Bruno Feroleto
    Bruno Feroleto over 6 years
    Agreed, @6502. The original question is indeed readily answered through the offficial semantics of Python dictionaries, which says that two equal numeric keys yield the same dictionary value (docs.python.org/3.6/reference/…). Mentioning hashes only delves into technicalities related to implementation choices. That's interesting but not fundamental to answering the question.
  • Kim
    Kim over 6 years
    @6502 The hashing is in fact the only reason for this behaviour. Limiting the answer to the equality only would keep the implementational truth secret; that dictionaries use the key's hash values to retrieve the values stored in them.