Short Python alphanumeric hash with minimal collisions
Solution 1
Why don't you just truncate SHA1 or MD5? You'll have more collisions then if you didn't truncate, but it's still better than designing your own. Note that you can base64-encode the truncated hash, rather than using hexadecimal. E.g.
import base64
import hashlib
hasher = hashlib.sha1("The quick brown fox")
base64.urlsafe_b64encode(hasher.digest()[:10])
You can truncate as little (including not at all) or as much as you want, as long as you understand the tradeoffs.
EDIT: Since you mentioned URL-safe, you can use urlsafe_b64encode and urlsafe_b64decode, which uses -
and _
rather than +
and /
.
Solution 2
The smallest builtin hash I am aware of is md5
>>> import hashlib, base64
>>> d=hashlib.md5(b"hello worlds").digest(); d=base64.b64encode(d);
>>> print(d)
b'S27ylES0wiLdFAGdUpFgCQ=='
Low collision and short are somewhat mutually exclusive due to the birthday paradox
To make it urlsafe you need to use the function from the base64 module
>>> import base64
>>> base64.urlsafe_b64encode(hashlib.md5("hello world").digest())
'XrY7u-Ae7tCTyyK7j1rNww=='
However there should be no problem storing the 16 byte md5 digest in the database in binary form.
>>> md5bytes=hashlib.md5("hello world").digest()
>>> len(md5bytes)
16
>>> urllib.quote_plus(md5bytes)
'%5E%B6%3B%BB%E0%1E%EE%D0%93%CB%22%BB%8FZ%CD%C3'
Python 2
>>> base64.urlsafe_b64encode(md5bytes)
'XrY7u-Ae7tCTyyK7j1rNww=='
Python 3
>>> base64.urlsafe_b64encode(md5bytes).decode('ascii')
'XrY7u-Ae7tCTyyK7j1rNww=='
You can choose either the quote_plus
or the urlsafe_b64encode
for your url, then decode with the corresponding function unquote_plus
or urlsafe_b64decode
before you look them up in the database.
Solution 3
Below is a solution that uses alphanumeric characters plus a few punctuation characters. It returns very short strings (around 8 characters).
import binascii, struct
def myhash(s):
return binascii.b2a_base64(struct.pack('i', hash(s)))
Solution 4
Hashids is a library (with Python support) that creates hashes that you can encode/decode very easily.
ensnare
Updated on January 23, 2020Comments
-
ensnare over 4 years
I'd like to set non-integer primary keys for a table using some kind of hash function. md5() seems to be kind of long (32-characters).
What are some alternative hash functions that perhaps use every letter in the alphabet as well as integers that are perhaps shorter in string length and have low collision rates?
Thanks!
-
ensnare over 14 yearsThanks. Is there any low collision alphanumeric hash function, less than say 16 characters, that does not involve truncating? Thank you.
-
ensnare over 14 yearsThanks. How can I make this urlsafe?
-
Matthew Flaschen over 14 yearsWhy don't you want to truncate?
-
John La Rooy over 14 years
hash(s)
gives a different result for 32/64 bit platforms -
Daniel Stutzbach over 14 years@gnibbler The question doesn't list consistency across platforms as a requirement.
-
speedplane almost 7 yearsYou may also want to remove all
=
characters added to the end. They don't substantially reduce collision rate, but they add two characters. So maybe something like:base64.urlsafe_b64encode(hasher.digest()[0:10]).replace('=', '')
-
bcattle about 6 years@speedplane Minor thing but
.rstrip('=')
makes the intent a little clearer -
wordsforthewise almost 6 yearshash() also gives different results each time you run a python instance
-
rana88 about 5 yearsUse
hashlib.hexdigest()
to get a url safe string without conversion -
user2471801 about 4 yearsI don't think you need to use
b64encode
the.hexdigest()
method of the hash object returns a string containing only hexadecimal digits. -
Abramodj about 4 yearsUse
.encode('utf-8')
if you get complains about encoding the initial string