Pros and cons of using md5 hash of URI as the primary key in a database

database primary-key guid md5 uri

18,024

Solution 1

It's perfectly fine.

Accidental collision of MD5 is impossible in all practical scenarios (to get a 50% chance of collision you'd have to hash 6 billion URLs per second, every second, for 100 years).

It's such an improbable chance that you're trillion times more likely to get your data messed up due to an undetected hardware failure than due to an actual collision.

Even though there is a known collision attack against MD5, intentional malicious collisions are currently impossible against hashed URLs.

The type of collision you'd need to intentionally collide with a hash of another URL is called a pre-image attack. There are no known pre-image attacks against MD5. As of 2017 there's no research that comes even close to feasibility, so even a determined well-funded attacker can't compute a URL that would hash to a hash of any existing URL in your database.
The only known collision attack against MD5 is not useful for attacking URL-like keys. It works by generating a pair of binary blobs that collide only with each other. The blobs will be relatively long, contain NUL and other unprintable bytes, so they're extremely unlikely to resemble anything like a URL.

Solution 2

After browsing stackoverfow a little more I found an earlier question Advantages and disadvantages of GUID / UUID database keys which covers much of this ground.

Solution 3

Multiple strings can produce the same md5 hash. Primary keys must be unique. So using the hash as the primary key is not good. Better is to use the GUID directly.

Is a GUID suitable for use in a URL. Sure. Here's a GUID (actually, a UUID) I jsut created using Java: 1ccb9467-e326-4fed-b9a7-7edcba52be84

The url could be:

http://example.com/view?id=1ccb9467-e326-4fed-b9a7-7edcba52be84

It's longish but perfectly usable and achieves what you describe.

18,024

Author by

rdmpage

Updated on June 05, 2022

Comments

rdmpage almost 2 years
I'm building a database that will store information on a range of objects (such as scientific papers, specimens, DNA sequences, etc.) that all have a presence online and can be identified by a URL, or an identifier such as a DOI. Using these GUIDs as the primary key for the object seems a reasonable idea, and I've followed delicious and Connotea in using the md5 hash of the GUID. You'll see the md5 hash in your browser status bar if you mouse over the edit or delete buttons in a delicious or Connotea book mark. For example, the bookmark for http://stackoverflow/ is
```
http://delicious.com/url/e4a42d992025b928a586b8bdc36ad38d
```
where e4a42d992025b928a586b8bdc36ad38d ais the md5 hash of http://stackoverflow/.

Does anybody have views on the pros and cons of this approach?

For me an advantage of this approach (as opposed to using an auto incrementing primary key generated by the database itself) is that I have to do a lot of links between objects, and by using md5 hashes I can store these links externally in a file (say, as the result of data mining/scraping), then import them in bulk into the database. In the same way, if the database has to be rebuilt from scratch, the URLs to the objects won't change because they use the md5 hash.

I'd welcome any thoughts on whether this sounds sensible, or whether there other (better?) ways of doing this.