How to hash a string into 8 digits?

175,838

Solution 1

Yes, you can use the built-in hashlib module or the built-in hash function. Then, chop-off the last eight digits using modulo operations or string slicing operations on the integer form of the hash:

>>> s = 'she sells sea shells by the sea shore'

>>> # Use hashlib
>>> import hashlib
>>> int(hashlib.sha1(s.encode("utf-8")).hexdigest(), 16) % (10 ** 8)
58097614L

>>> # Use hash()
>>> abs(hash(s)) % (10 ** 8)
82148974

Solution 2

Raymond's answer is great for python2 (though, you don't need the abs() nor the parens around 10 ** 8). However, for python3, there are important caveats. First, you'll need to make sure you are passing an encoded string. These days, in most circumstances, it's probably also better to shy away from sha-1 and use something like sha-256, instead. So, the hashlib approach would be:

>>> import hashlib
>>> s = 'your string'
>>> int(hashlib.sha256(s.encode('utf-8')).hexdigest(), 16) % 10**8
80262417

If you want to use the hash() function instead, the important caveat is that, unlike in Python 2.x, in Python 3.x, the result of hash() will only be consistent within a process, not across python invocations. See here:

$ python -V
Python 2.7.5
$ python -c 'print(hash("foo"))'
-4177197833195190597
$ python -c 'print(hash("foo"))'
-4177197833195190597

$ python3 -V
Python 3.4.2
$ python3 -c 'print(hash("foo"))'
5790391865899772265
$ python3 -c 'print(hash("foo"))'
-8152690834165248934

This means the hash()-based solution suggested, which can be shortened to just:

hash(s) % 10**8

will only return the same value within a given script run:

#Python 2:
$ python2 -c 's="your string"; print(hash(s) % 10**8)'
52304543
$ python2 -c 's="your string"; print(hash(s) % 10**8)'
52304543

#Python 3:
$ python3 -c 's="your string"; print(hash(s) % 10**8)'
12954124
$ python3 -c 's="your string"; print(hash(s) % 10**8)'
32065451

So, depending on if this matters in your application (it did in mine), you'll probably want to stick to the hashlib-based approach.

Solution 3

Just to complete JJC answer, in python 3.5.3 the behavior is correct if you use hashlib this way:

$ python3 -c '
import hashlib
hash_object = hashlib.sha256(b"Caroline")
hex_dig = hash_object.hexdigest()
print(hex_dig)
'
739061d73d65dcdeb755aa28da4fea16a02b9c99b4c2735f2ebfa016f3e7fded
$ python3 -c '
import hashlib
hash_object = hashlib.sha256(b"Caroline")
hex_dig = hash_object.hexdigest()
print(hex_dig)
'
739061d73d65dcdeb755aa28da4fea16a02b9c99b4c2735f2ebfa016f3e7fded

$ python3 -V
Python 3.5.3
Share:
175,838
Bob Fang
Author by

Bob Fang

@Bytedance

Updated on December 01, 2020

Comments

  • Bob Fang
    Bob Fang over 3 years

    Is there anyway that I can hash a random string into a 8 digit number without implementing any algorithms myself?

    • Theran
      Theran about 11 years
      hash("your string") % 100000000
    • DhruvPathak
      DhruvPathak about 11 years
      8 digit seems to small, and may result in collisions of hashes if you have large number of records. stackoverflow.com/questions/1303021/…
    • architectonic
      architectonic over 7 years
      Use hashlib since hash has another purpose!
    • Alex North-Keys
      Alex North-Keys about 7 years
      Any finite number of digits will result in collisions for sufficiently large numbers of hash items, that's why you shouldn't treat them as unique keys - it tends to turn into the birthday problem.
    • tryptofame
      tryptofame almost 7 years
      I've chosen "CityHash" to hash strings to 19 digit long integers (64bit integers), hoping this will lead to less potential collisions than Raymond's suggestion below. en.wikipedia.org/wiki/List_of_hash_functions
  • twneale
    twneale almost 9 years
    public service announcement...this technique doesn't actually result in a unique hash value for the string; it computes a hash and then munges into a non-guaranteed-unique value
  • Raymond Hettinger
    Raymond Hettinger almost 9 years
    public service announcement...except for the special case of perfect hashes over limited set of input values, hash functions aren't supposed to generate guaranteed unique values.
  • twneale
    twneale almost 9 years
    Probably true, but virtually all of their practical utility derives from their their good-enough tendency to produce unique values. The probability of a 'hash' collision using this trick is probably 10 or 11 orders of magnitude higher than md5
  • Raymond Hettinger
    Raymond Hettinger almost 9 years
    Did you read the OP's question? He (or she) wanted (or needed) 8 decimal places. Also, the way hash tables work is to hash into a small search space (the sparse table). You seem to not know want hash functions are commonly used for and to not care about the actual question that was asked.
  • twneale
    twneale almost 9 years
    I read the question. I'm simply observing that over the same input space as SHA-1, your answer is astronomically more likely to produce a collision than not. At least some degree of uniqueness is implicitly required by the question, but your answer is a hash function in the same spirit as one that simply returns 12345678 for every input. I was able to experimentally generate a collision with as few as 1000 inputs using this method. To preserve the same collision probability as SHA-1, you would have to map un-truncated SHA-1's to 8-digit integers. I think that's worthy of a PSA
  • Mr. Napik
    Mr. Napik over 8 years
    Careful, hash(s) is not guarateed to give same results across platforms and runs.
  • Doug
    Doug about 8 years
    Is abs needed? Modulo should return a positive int.
  • Raymond Hettinger
    Raymond Hettinger about 8 years
    In Python2, hash('agir') is -2835743962885600615.
  • JJC
    JJC over 7 years
    Right, but what I think Doug meant is that even a negative number mod something will always produce a positive number, so it seems you can drop the abs(). Also, I think the relative operator precedence of exponentiation means we don't even need the second parens. Thanks for the answer, though! >>> hash(s) % 10**8 produces 57227199
  • JJC
    JJC over 7 years
    An important caveat is that, unlike with Python 2.x, hash(x) returns a different value on each Python 3.x interpreter invocation (it is consistent within a single process). So, if the OP is depending on the hash to be the same for a given string across script runs, the latter will not work in Python 3.x. This just bit me. I will add an answer to reflect these two comments (not yet sure about etiquette of editing).
  • Wolph
    Wolph over 6 years
    It should be noted that this answer has a very important caveat since Python 3.3, to protect against tar-pitting Python 3.3 and above use a random hash seed upon startup.
  • silgon
    silgon over 5 years
    Should use 1e8 instead of 10**8 you're performing an extra computation when there is absolutely no need. Also, nice answer, it's exactly what I was looking for.
  • Raymond Hettinger
    Raymond Hettinger over 5 years
    @silgon Python's peephole optimizer does constant folding, so the computation is only done once. That is easy to verify. Run dis(compile('10 ** 8', '', 'eval')) and look for the fragment LOAD_CONST 0 (100000000). Alternatively, run def f(): return 10**8 and observe that f.__code__.co_consts returns (None, 100000000). Notes that 10E8 isn't a valid substitute because that is a float rather than an int.
  • silgon
    silgon over 5 years
    Wow... I just checked what you said, you're right, and it's really interesting, I thought that the power operation ** would always run an operation, however as you said, it's LOAD_CONST. Thanks for the interesting reply. Also, you're right, the scientific notation 1e8 gives a float.
  • lony
    lony over 5 years
    If digits are not your main requirement you could also use hashlib.sha256("hello world".encode('utf-8')).hexdigest()[:8] witch still will have collisions
  • user1783732
    user1783732 almost 5 years
    Some of the comments regarding 'unique value' are confusing. I am trying to do same thing, and tested in Python 3.7.4 and 3.5.3 on two different machines. For the same input string, the result are the same. Is it true that the same input string always results in the same output for hashlib.sha1 ?
  • Tomasz
    Tomasz over 4 years
    They should put that on the box!
  • Harabeck
    Harabeck over 4 years
    You're sharing a nodejs solution in a question about python?
  • user 923227
    user 923227 over 4 years
    Yes, when we were building the system - the backend processed this using python while the frontend used node.js. Needed to make sure both work seamlessly.
  • ingo
    ingo about 4 years
    If your extracted 8 digits start with a 0, you'll end up with a 7 digit number.
  • kaptan
    kaptan almost 3 years
    I think it is worth mentioning that if you want a stable hash you you should use the hashlib solution.