Python shortest unique id from strings

python md5 hashlib

21,533

Solution 1

One crude way can be, you could do md5 and then pick first 16 characters from it, instead of all 32. Collisions still won't be that high, and you'll have reasonable uniqueness guarantee.

Solution 2

You can save it as integer:

id_ = int(hashlib.md5(your_str).hexdigest(), 16)

Or as binary string:

id_ = hashlib.md5(your_str).digest()

Solution 3

The simplest solutions is to convert hexadecimal data (yor digests have base of 16) to something else, eg. with base 64.

If you agree on some level of higher risk, you can use only eg first ten digits (hexadecimal) of the digest. It will give you 16**10 (more than 10**12) possibilities instead of 16**32 (more than 10**38), but it is still huge and is commonly used technique (Git and Github usually use 7 digits for identifying commits, afair).

21,533

Author by

Blemone

Updated on September 11, 2020

Comments

Blemone over 3 years
I have more than 100 million unique strings (VARCHAR(100) UNIQUE in MySQL database). Now I use the code below to create unique hash from them (VARCHAR(32) UNIQUE) in order to reduct index size of the InnoDB table (a unique index on varchar(100) is roughly 3 times larger than on varchar(32) field).
```
id = hashlib.md5(str).hexdigest()
```
Is there any other method to create shorter ids from those strings and make reasonable uniqueness guarantees?