"Alphanumeric" hash - A-Z, 0-9

10,453

How about generating your SHA256 and then Base36 encoding the result? No left over bits, no bias...

That way you have the cryptographic strength of a proven algorithm (remember to salt and use multiple hash iterations) along with the alphanumeric representation that you need.

Share:
10,453
KeithS
Author by

KeithS

Updated on September 15, 2022

Comments

  • KeithS
    KeithS over 1 year

    I'm looking for a function that will generate an "alphanumeric hash". Given a source string, it produces a determinate result string that can contain any letter a-z or digit 0-9, and cannot be reverse-engineered to produce the source. This will be used to generate passwords for a system based on secret data, so strings between 8 and 12 characters are ideal and a secure hash would also be ideal.

    I'm thinking I can use a normal bitwise hash, XOR-fold it to 64 bits (if I use, for instance, SHA256) and then take the result 5 bits at a time (producing a number 0-31) and look up the character code to use from an indexed ordered collection. There are 26 letters and 10 digits meaning I'll have to leave a few out (probably removing characters that could be mistaken for others if handwritten). 64 bits, 5 bits at a time, will produce a 12-character string with 4 bits left over.

    However, I'm worried about two things: first, introducing bias by taking a non-power-of-2 number of bits; and second, what to do with the leftover bits. Do I use them as-is knowing there will only be 16 possibilities, do I leave them off (and lose data possibly introducing bias), or do I incorporate one more bit to make a 13-character string (and where should the last bit come from)?

    EDIT: Here's my current stab at it; it takes an enumerable of bytes (like the byte array produced by most hash algorithms) and returns a string:

        /// <summary>
        /// Converts an IEnumerable of bytes to a string representation which can have any lowercase letter a-z except for l, o, q and z, and any digit 0-9.
        /// Uses 5 bits of the byte array at a time to generate numbers from 0 to 31, which are then translated to letters or numbers.
        /// </summary>
        /// <param name="toConvert">the byte array to convert.</param>
        /// <returns>A string containing the alphanumeric case-insensitive representation of the bytes in the array.</returns>
        public static string ToInsensitiveAlphaNumericString(this IEnumerable<byte> toConvert)
        {
            var chars = new[]
                            {
                                'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'm', 'n', 'p', 'r', 's', 't',
                                'u', 'v', 'w', 'x', 'y', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
                            };
    
            var enumerator = toConvert.GetEnumerator();
            enumerator.MoveNext();
    
            int buffer = enumerator.Current;
            short bufferLength = 8;
            const int valueLength = 5;
    
            var builder = new StringBuilder();
    
            while (true)
            {
                var value = buffer >> (bufferLength - valueLength);
    
                builder.Append(chars[value]);
    
                buffer = buffer - (value << (bufferLength - valueLength));
                bufferLength -= valueLength;
    
                if(bufferLength < valueLength )
                {
                    if (enumerator.MoveNext())
                    {
                        buffer = (buffer << 8) + enumerator.Current;
                        bufferLength += 8;
                    }
                    else
                    {
                        //here's the main question; to include, or not to include?
                        if (bufferLength > 0)
                            builder.Append(chars[buffer]);
                        break;
                    }
                }
            }
    
            return builder.ToString();
        }
    
  • John Mitchell
    John Mitchell almost 12 years
    If your looking to just disguise what system you used for hashing and don't want the result to be easily reversible to the output of the hash function, just do this and apply a Caesar shift to it - simple, old, but can still give things an interesting twist :)
  • David
    David almost 12 years
    This solution will work fine, but it's worth noting that there is still a biased character. If the base-36 representation of 2^{256} has 6 as its first digit (I think it does, but I have only checked sloppily), then every encoded value will have a value between 0 and 6 in that position.
  • Eric J.
    Eric J. almost 12 years
    @David: You will have a bias of the character positions but that does not matter. Base36 is just a human-convenient representation of a perfectly unbiased 256-bit number.
  • David
    David almost 12 years
    Yep, I totally agree. I just mentioned it because the OP mentioned concern about biased characters (when he was already talking about a human-convenient representation of a perfectly unbiased 64-bit number).
  • KeithS
    KeithS almost 12 years
    This is a good idea, and trivial on a 64-bit number. Slightly more complex on a number stored in multiple bytes, but still totally possible.
  • KeithS
    KeithS almost 12 years
    Only one possible issue with this; a hash value can be any integer value that will fit in the hash's generated bits. A UInt64, for instance, can be zero, or it can be 18,446,744,073,709,551,615. I'm only guaranteed to get a 12-char password if the value is greater than 36^12 (about a 75% chance), and there is a not-insignificant chance of 10 or fewer characters. While infinitesimal, there is the possibility that I could end up with a single-character password.
  • Eric J.
    Eric J. almost 12 years
    If you pad with 0's, you'll still end up with suitably long passwords. The character sequence "0000abcd" is just as probable as "0001abcd" so an attacker would not gain any benefit by knowing that certain values are zero-padded. Those values still have to be tried, and are just as probable as any other sequence. Short passwords aren't bad per se, the are bad because humans tend to pick very short passwords creating a strong statistical bias toward short passwords. "a" would be just as good a password as "dut6dja1jdi9" if not for that bias.
  • granadaCoder
    granadaCoder over 2 years
    For future readers. If you can be a little more "flexible".. to all alpha-numerics (including lower-case, which the original question does not desire)....then you can consider Base62, and this SOF answer has some dotnet libraries that exist : stackoverflow.com/questions/17745025/… but those dotnet libraries may help you find java/python/other equivalents
  • granadaCoder
    granadaCoder over 2 years
    And during this process of discovery I learned about base58. See : en.wikipedia.org/wiki/Base62 " The 0OIl characters are not used in the base58 encoding scheme."