Short Python alphanumeric hash with minimal collisions

33,207

Solution 1

Why don't you just truncate SHA1 or MD5? You'll have more collisions then if you didn't truncate, but it's still better than designing your own. Note that you can base64-encode the truncated hash, rather than using hexadecimal. E.g.

import base64
import hashlib
hasher = hashlib.sha1("The quick brown fox")
base64.urlsafe_b64encode(hasher.digest()[:10])

You can truncate as little (including not at all) or as much as you want, as long as you understand the tradeoffs.

EDIT: Since you mentioned URL-safe, you can use urlsafe_b64encode and urlsafe_b64decode, which uses - and _ rather than + and /.

Solution 2

The smallest builtin hash I am aware of is md5

>>> import hashlib, base64
>>> d=hashlib.md5(b"hello worlds").digest(); d=base64.b64encode(d); 
>>> print(d)

b'S27ylES0wiLdFAGdUpFgCQ=='

Low collision and short are somewhat mutually exclusive due to the birthday paradox

To make it urlsafe you need to use the function from the base64 module

>>> import base64
>>> base64.urlsafe_b64encode(hashlib.md5("hello world").digest())
'XrY7u-Ae7tCTyyK7j1rNww=='

However there should be no problem storing the 16 byte md5 digest in the database in binary form.

>>> md5bytes=hashlib.md5("hello world").digest()
>>> len(md5bytes)
16
>>> urllib.quote_plus(md5bytes)
'%5E%B6%3B%BB%E0%1E%EE%D0%93%CB%22%BB%8FZ%CD%C3'

Python 2

>>> base64.urlsafe_b64encode(md5bytes)
'XrY7u-Ae7tCTyyK7j1rNww=='

Python 3

>>> base64.urlsafe_b64encode(md5bytes).decode('ascii')
'XrY7u-Ae7tCTyyK7j1rNww=='

You can choose either the quote_plus or the urlsafe_b64encode for your url, then decode with the corresponding function unquote_plus or urlsafe_b64decode before you look them up in the database.

Solution 3

Below is a solution that uses alphanumeric characters plus a few punctuation characters. It returns very short strings (around 8 characters).

import binascii, struct

def myhash(s):
    return binascii.b2a_base64(struct.pack('i', hash(s)))

Solution 4

Hashids is a library (with Python support) that creates hashes that you can encode/decode very easily.

http://hashids.org/python/

Share:
33,207
ensnare
Author by

ensnare

Updated on January 23, 2020

Comments

  • ensnare
    ensnare over 4 years

    I'd like to set non-integer primary keys for a table using some kind of hash function. md5() seems to be kind of long (32-characters).

    What are some alternative hash functions that perhaps use every letter in the alphabet as well as integers that are perhaps shorter in string length and have low collision rates?

    Thanks!

  • ensnare
    ensnare over 14 years
    Thanks. Is there any low collision alphanumeric hash function, less than say 16 characters, that does not involve truncating? Thank you.
  • ensnare
    ensnare over 14 years
    Thanks. How can I make this urlsafe?
  • Matthew Flaschen
    Matthew Flaschen over 14 years
    Why don't you want to truncate?
  • John La Rooy
    John La Rooy over 14 years
    hash(s) gives a different result for 32/64 bit platforms
  • Daniel Stutzbach
    Daniel Stutzbach over 14 years
    @gnibbler The question doesn't list consistency across platforms as a requirement.
  • speedplane
    speedplane almost 7 years
    You may also want to remove all = characters added to the end. They don't substantially reduce collision rate, but they add two characters. So maybe something like: base64.urlsafe_b64encode(hasher.digest()[0:10]).replace('=', '')
  • bcattle
    bcattle about 6 years
    @speedplane Minor thing but .rstrip('=') makes the intent a little clearer
  • wordsforthewise
    wordsforthewise almost 6 years
    hash() also gives different results each time you run a python instance
  • rana88
    rana88 about 5 years
    Use hashlib.hexdigest() to get a url safe string without conversion
  • user2471801
    user2471801 about 4 years
    I don't think you need to use b64encode the .hexdigest() method of the hash object returns a string containing only hexadecimal digits.
  • Abramodj
    Abramodj about 4 years
    Use .encode('utf-8') if you get complains about encoding the initial string