Hashing (hiding) strings in Python

14,313

Solution 1

First off, let me say that you can't guarantee unique results. If you wanted unique results for all the strings in the universe, you're better off storing the string itself (or a compressed version).

More on that in a second. Let's get some hashes first.

hashlib way

You can use any of the main cryptographic hashes to hash a string with a few steps:

>>> import hashlib
>>> sha = hashlib.sha1("I am a cat")
>>> sha.hexdigest()
'576f38148ae68c924070538b45a8ef0f73ed8710'

You have a choice between SHA1, SHA224, SHA256, SHA384, SHA512, and MD5 as far as built-ins are concerned.

What's the difference between those hash algorithms?

A hash function works by taking data of variable length and turning it into data of fixed length.

The fixed length, in the case of each of the SHA algorithms built into hashlib, is the number of bits specified in the name (with the exception of sha1 which is 160 bits). If you want better certainty that two strings won't end up in the same bucket (same hash value), pick a hash with a bigger digest (the fixed length).

In sorted order, these are the digest sizes you have to work with:

Algorithm  Digest Size (in bits)
md5        128
sha1       160
sha224     224
sha256     256
sha384     384
sha512     512

The bigger the digest the less likely you'll have a collision, provided your hash function is worth its salt.

Wait, what about hash()?

The built in hash() function returns integers, which could also be easy to use for the purpose you outline. There are problems though.

>>> hash('moo')
6387157653034356308
  1. If your program is going to run on different systems, you can't be sure that hash will return the same thing. In fact, I'm running on a 64-bit box using 64-bit Python. These values are going to be wildly different than for 32-bit Python.

  2. For Python 3.3+, as @gnibbler pointed out, hash() is randomized between runs. It will work for a single run, but almost definitely won't work across runs of your program (pulling from the text file you mentioned).

Why would hash() be built that way? Well, the built in hash is there for one specific reason. Hash tables/dictionaries/look up tables in memory. Not for cryptographic use but for cheap lookups at runtime.

Don't use hash(), use hashlib.

Solution 2

You can simply use the base64 module to achieve your goal:

>>> import base64
>>> a = 'helloworld'
>>> encoded_str = base64.encodestring(a)
>>> encoded_str
'aGVsbG93b3JsZA=='
>>> base64.decodestring(encoded_str)
'helloworld'
>>>

of course you can also use the the hashlib module, it's more secure , because the hashed string cannot(or very very hard) be decoded latter, but for your question base64 is enough -- "It doesn't really have to be secure"

Solution 3

Note that Python's string hash is not "defined" - it can, and does, vary across releases and implementations. So storing a Python string hash will create difficulties. CPython's string hash makes no attempt to be "obscure", either.

A standard approach is to use a hash function designed for this kind of thing. Like this:

>>> import hashlib
>>> encoded = hashlib.sha1("abcdef") # "abcdef" is the password
>>> encoded.hexdigest()
'1f8ac10f23c5b5bc1167bda84b833e5c057a77d2'

That long string of hexadecimal digits is "the hash". SHA-1 is a "strong" hash function. You can get famous if you find two strings that hash to the same value ;-) And given the same input, it will return the same "hexdigest" on all platforms across all releases and implementations of Python.

Share:
14,313
Lucas
Author by

Lucas

Updated on June 28, 2022

Comments

  • Lucas
    Lucas about 2 years

    What I need is to hash a string. It doesn't have to be secure because it's just going to be a hidden phrase in the text file (it just doesn't have to be recognizable for a human-eye).

    It should not be just a random string because when the users types the string I would like to hash it and compare it with an already hashed one (from the text file).

    What would be the best for this purpose? Can it be done with the built-in classes?

  • Lucas
    Lucas over 10 years
    Is this going to generate unique numbers for each string? Can it be decoded (just curious)?
  • Tim Peters
    Tim Peters over 10 years
    @Lucas, it's impossible for a fixed-size hash function to return a different value for all possible strings. For example, if a hash function returns 2 bits, it has only 4 possible values.
  • roippi
    roippi over 10 years
    @Lucas no, a hash cannot be 'decoded'. If two objects are equal, their hash is equal; however, many objects can (will, in the case of strings) resolve to the same hash value.
  • Lucas
    Lucas over 10 years
    Does base64 come with Python2.3 (yeah I know its odd) by default?
  • John La Rooy
    John La Rooy over 10 years
    hash() is randomized between runs since Python3.3 ie. you can only rely on it returning the same value within a single run of a program
  • John La Rooy
    John La Rooy over 10 years
    Especially since Python3.3 hash(somestring) is different between runs
  • John La Rooy
    John La Rooy over 10 years
    This is a bad idea. See my comments on other answers
  • John La Rooy
    John La Rooy over 10 years
    @Lucas. No there are many more different possible strings than hash values, so you must get collisions sometimes.
  • tinylambda
    tinylambda over 10 years
    Yes! It's OK to run the above code in Python 2.3 Learn More From Here
  • Kyle Kelley
    Kyle Kelley over 10 years
    Great. Thanks for that @gnibbler, I didn't know it wasn't stable between runs.
  • 0 _
    0 _ over 6 years
    An example of hash collision for strings in Python 2 is described at: stackoverflow.com/q/37127946/1959808