Create your own MD5 collisions

48,663

Solution 1

These following two different 128 byte sequences hash to the same:

MD5 Hash: 79054025255fb1a26e4bc422aef54eb4

The differences below are highlighted (bold). Sorry it's kind of hard to see.

d131dd02c5e6eec4693d9a0698aff95c 2fcab58712467eab4004583eb8fb7f89 
55ad340609f4b30283e488832571415a 085125e8f7cdc99fd91dbdf280373c5b 
d8823e3156348f5bae6dacd436c919c6 dd53e2b487da03fd02396306d248cda0 
e99f33420f577ee8ce54b67080a80d1e c69821bcb6a8839396f9652b6ff72a70

and

d131dd02c5e6eec4693d9a0698aff95c 2fcab50712467eab4004583eb8fb7f89 
55ad340609f4b30283e4888325f1415a 085125e8f7cdc99fd91dbd7280373c5b 
d8823e3156348f5bae6dacd436c919c6 dd53e23487da03fd02396306d248cda0 
e99f33420f577ee8ce54b67080280d1e c69821bcb6a8839396f965ab6ff72a70

The visualization of the collision/block1 (Source: Links.Org)

alt text

The visualization of the collision/block2 (Source: Links.Org)

alt text

Solution 2

It's hard to do it with just text files, AFAIK. You can get some collisions, but having them also be from just [a-zA-Z] is not easy (yet).

On the other hand, if you just want two "meaningful"-looking files with the same hash, you can do it with something like, say, PostScript: have different binary blobs causing the collision, and use a conditional expression to display different output accordingly.

See e.g. this problem (the H2 part) and solution. For example, this PS file and this one have the same MD5sum but they are both well-formed PostScript files that have entirely different text in them when you open them.

Solution 3

If you're talking about how likely a straightforward collision is - one where there is no deliberate attempt to cause one - then you're going to be disappointed: You'd need to generate on average 2^64 plaintexts before you can expect to see a collision, and that's substantially more than you're going to be able to do in a reasonable (or really, even an _un_reasonable) time.

If you're looking to demonstrate the difficulty of deliberately crafting a collision, other answers have already demonstrated that. The extra constraint of requiring the strings to be entirely textual makes even those approaches largely impractical, though.

Solution 4

I would take a look at Hashcash. With an effective hash algorithm, like md5, the time to calculate a collision to exponential with the number of bits. What Hashcash does is calculates partial collisions. That is, a match of say the lower 16 bits of the hash. To get the lower 16 bits to match, one would have to try hashing 2^15 different combinations on average. If you know how long it takes to come up with a 16, 24, or 32 bit collision, then you can easily calculate out the time for higher numbers of bits.

Share:
48,663

Related videos on Youtube

russau
Author by

russau

Russ Sayers Twitter: @russaus https://github.com/russau http://tetrisapp.appspot.com/ http://twitterheatmap.com/ http://www.imdb.com/name/nm5363927/

Updated on July 09, 2020

Comments

  • russau
    russau almost 4 years

    I'm doing a presentation on MD5 collisions and I'd like to give people any idea how likely a collision is.

    It would be good to have two blocks of text which hash to the same thing, and explain how many combinations of [a-zA-Z ] were needed before I hit a collision.

    The obvious answer is hash every possible combination until hit two hashes the same. So how would you go about coding this. As a quick experiment I tried hashing every combination of 5 columns of [A-Z], storing this in a .net hashtable and catching the collision exception. Two problems with this - the hashtable eventually times out, and I'm pretty sure I'm going to need A LOT more characters.

    Obviously this data structure is too big to handle in memory, so now I'll have to get a database involved. Also sounds like a good project to test out azure - a bit like these guys.

    Can anyone point me in the direction of an efficient way of doing this?

  • Shalmanese
    Shalmanese almost 15 years
    This is incorrect due to the birthday paradox: en.wikipedia.org/wiki/Birthday_paradox. In particular, see en.wikipedia.org/wiki/…
  • russau
    russau almost 15 years
  • causa prima
    causa prima almost 15 years
    Note that I said BY CHANCE.
  • russau
    russau almost 15 years
    Fair enough - when you say 'by chance' you mean 'brute force'? so my question is really asking for something more efficient that brute forcing it. or running the brute force combinations on a server farm like azure.
  • Nick Johnson
    Nick Johnson almost 15 years
    No - that's why I said 2^64, not 2^128. The birthday 'paradox' predicts a collision (on average) after 2^(numbits/2).
  • causa prima
    causa prima almost 15 years
    He wants to give people an impression of how likely they are--thus by chance, or brute force if you want to call it that, is the right metric.
  • Iceland_jack
    Iceland_jack over 12 years
    Actual code for testing this (in Python, perl).
  • Mathias Bynens
    Mathias Bynens about 11 years
    Actual code for testing this in JavaScript: gist.github.com/mathiasbynens/5525001
  • Doug Richardson
    Doug Richardson over 8 years
    Did you mean HashClash?
  • Julian F. Weinert
    Julian F. Weinert over 7 years
    Got an even nicer example! It has two completely different images that are basically a collision: natmchugh.blogspot.de/2015/02/…
  • Hashmatullah Noorzai
    Hashmatullah Noorzai over 5 years
    @Iceland_jack Python link is not reachable!
  • cowlinator
    cowlinator over 2 years
    For those confused like i was, both the red and green pixels represent diffs. The black and white ones represent sames.