The most effective way to assign unique integer id to a string?

13,712

Solution 1

For comparison purposes, you can intern the strings and then compare them with is instead of ==, which does a simple pointer comparison and should be as fast as (or faster than) comparing two integers:

>>> 'foo' * 100 is 'foo' * 100
False
>>> intern('foo' * 100) is intern('foo' * 100)
True

intern guarantees that id(intern(A)) == id(intern(B)) iff A == B. Be sure to intern any string as soon as it is input. Note that intern is called sys.intern in Python 3.x.

But when you have to pass these strings to other processes, your dict solution seems best. What I usually do in such situations is

str_to_id = {}
for s in strings:
    str_to_id.setdefault(s, len(str_to_id))

so the integer capacity is more than enough

Python integers are bigints, so that should never be a problem.

Solution 2

How about the hash function?

In [130]: hash
Out[130]: <function hash>

In [131]: hash('foo')
Out[131]: -740391237

There is no need to store hashes (unless you want to): the point is that they are equal for objects that are value-equal (although the reverse may not be true - there are no doubt unequal strings or other objects that hash to the same value; such is the nature of hashing).

If you know the range of your keys (and you probably do), you could also use a perfect hash function generator. This is apparently one for python: http://ilan.schnell-web.net/prog/perfect-hash/

Perfect hashes guarantee that keys within the specified range have a bijective relationship with their hash value.

Solution 3

You could use one of the hashlib algorithms to create a cryptographically sound digest of the long message, and then use this as dictionary keys. Example using SHA-256:

import hashlib
...
key = hashlib.sha256(longMessage).digest()

The chance of collisions is much smaller this way than by using hash(longMessage).

However, this could introduce a potentially big overhead. Unless memory usage is a big concern I would simply use the original strings as keys instead.

Solution 4

I've used the following for this purpose:

>>> from collections import defaultdict
>>> d = defaultdict(lambda: len(d))
>>> d["cats"]
0
>>> d["cars"]
1
>>> d["cats"]
0

Solution 5

If they are stored in memory, and you're comparing each string as an object rather than as text I would suggest using id(string) to get a unique integer. Alternatively, if you're storing them in a dict you could use a defaultdict with a set of matches and hash them:

>>> strings = 'a whole lot of strings which may share a hash'.split()
>>> storage = defaultdict(set)
>>> for s in strings:
...     storage[hash(s)].add(s)
>>> storage[hash('a')]
{'a', 'a'}

Exactly how you would implement this depends on how you're using them, but the basic idea should work. If you could post a specific example of what you're trying to do it might be easier to give a more detailed answer.

Share:
13,712
lithuak
Author by

lithuak

Updated on June 09, 2022

Comments

  • lithuak
    lithuak about 2 years

    The program that I write processes a large number of objects, each with its own unique id, which itself is a string of complicated structure (dozen of unique fields of the object joined by some separator) and big length.

    Since I have to process a lot of these objects fast and I need to reffer to them by id while processing and I have no power to change their format (I retrieve them externally, by network), I want to map their complicated string id to my own internal integer id and further use it for comparison, for transfering them further to other processes, etc.

    What I'm going to do is to use a simple dict with keys as string id of the object and integer values as my internal integer id of it.

    My question is: is there a better way in Python to do this? May be there is a way to calculate some hash manually, whatever? May be the dict is not the best solution?

    As for numbers: there are about 100K of such unique objects in the system at a time, so the integer capacity is more than enough.

  • Fred Foo
    Fred Foo over 12 years
    id only works for this purpose after interning. Comparing the id of two objects is the same as comparing with is.
  • aquavitae
    aquavitae over 12 years
    True. That's why I said compare as objects rather than text. intern would certainly be necessary, then id would give a unique integer.
  • lithuak
    lithuak over 12 years
    Thanks for great link and proposition, Marcin! Collision are what I'm concerned about: I would like to use my own hash function, exploting my knoweledge of key structure, but to handle collisions I would have to implement the whole slots mechanism, while dict() gives it to me for free, even though its internal hashing would probably be more general and thus less effective than I could do for my specific strings.
  • Dhruv Ghulati
    Dhruv Ghulati almost 8 years
    This didn't work for me, when I actually print the intern for 2 strings: featureVal is cents Encoded feature value is 4571865856 featureVal is number_slot Encoded feature value is 4571867200 featureVal is bn Encoded feature value is 4537878648 featureVal is saudi Encoded feature value is 4571865856 - note that the word cents is exactly the same feature value as saudi. The line is print "featureVal is ",featureVal print "Encoded feature value is", id(intern(str(featureVal.encode("utf-8"))))