hashing function guaranteed to be unique?

19,678

Solution 1

It's logically impossible to get a 32 byte code from a 200 byte source which is unique among all possible 200 byte sources, since you can store more information in 200 bytes than in 32 bytes.

They only exception would be that the information stored in these 200 bytes would also fit into 32 bytes, in which case your source date format would be extremely inefficient and space-wasting.

Solution 2

When hashing (as opposed to encrypting), you're reducing the information space of the data being hashed, so there's always a chance of a collision.

The best you can hope for in a hash function is that all hashes are evenly distributed in the hash space and your hash output is large enough to provide your "once-per-10-ages-of-the-universe sort of deal" as you put it!

So whether a hash is "good enough" for you depends on the consequences of a collision. You could always add a unique id to a checksum/hash to get the best of both worlds.

Solution 3

Why don't you use a unique ID from your database?

Solution 4

The probability of two hashes will likely to collide depends on the hash size. MD5 produces 128-bit hash. So for 2128+1 number of hashes there will be at least one collision.

This number is 2160+1 for SHA1 and 2512+1 for SHA512.

Here this rule applies. The more the output bits the more uniqueness and more computation. So there is a trade off. What you have to do is to choose an optimal one.

Solution 5

Could two ~200 char bytearrays MD5-hash down to the same string?

Considering that there are more 200 byte strings than 32 byte strings (MD5 digests), that is guaranteed to be the case.

All hash functions have that problem, but some are more robust than MD5. Try SHA-1. git is using it for the same purpose.

Share:
19,678
Max Williams
Author by

Max Williams

Updated on July 13, 2022

Comments

  • Max Williams
    Max Williams almost 2 years

    In our app we're going to be handed png images along with a ~200 character byte array. I want to save the image with a filename corresponding to that bytearray, but not the bytearray itself, as i don't want 200 character filenames. So, what i thought was that i would save the bytearray into the database, and then MD5 it to get a short filename. When it comes time to display a particular image, i look up its bytearray, MD5 it, then look for that file.

    So far so good. The problem is that potentially two different bytearrays could hash down to the same MD5. Then, one file would effectively overwrite another. Or could they? I guess my questions are

    • Could two ~200 char bytearrays MD5-hash down to the same string?
    • If they could, is it a once-per-10-ages-of-the-universe sort of deal or something that could conceivably happen in my app?
    • Is there a hashing algorithm that will produce a (say) 32 char string that's guaranteed to be unique?