Does an identical cryptographic hash or checksum for two files mean they are identical?

checksum

31,449

Solution 1

When the hashes are identical, does this mean that the file contents are 1:1 the same?

All files are a collection of bytes (values 0-255). If two files MD5 hashes match, both those collections of bytes are extremely likely the exact same (same order, same values).

There's a very small chance that two files can generate the same MD5, which is a 128 bit hash. The probability is:

Probability of just two hashes accidentally colliding is 1/2¹²⁸ which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456. (from an answer on StackOverflow.)

Hashes are meant to work in "one direction only" - i.e. you take a collection of bytes and get a hash, but you can't take a hash and get back a collection of bytes.

Cryptography depends on this (it's one way two things can be compared without knowing what those things are.)

Around the year 2005, methods were discovered to ~~take an MD5 hash and create data that matches that hash~~ create two documents that had the same MD5 hash (collision attack). See @user2357112's comment below. This means an attacker can create two executables, for example, that have the same MD5, and if you are depending on MD5 to determine which to trust, you'll be fooled.

Thus MD5 should not be used for cryptography or security. It's bad to publish an MD5 on a download site to ensure download integrity, for example. Depending on an MD5 hash you did not generate yourself to verify file or data contents is what you want to avoid.

If you generate your own, you know you're not being malicious to yourself (hopefully). So for your use, it's OK, but if you want someone else to be able to reproduce it, and you want to publicly publish the MD5 hash, a better hash should be used.

Note that it's possible for two Excel files to contain the same values in the same rows and columns, but for the bytestream of the file to be completely different due to different formatting, styles, settings, etc.

If you are wanting to compare the data in the file, export it to CSV with the same rows and columns first, to strip out all formatting, and then hash or compare the CSV's.

Solution 2

In practice, yes, an identical cryptographic hash means the files are the same, as long as the files were not crafted by an attacker or other malicious entity. The odds of random collisions with any well-designed cryptographic hash function is so small as to be negligible in practice and in the absence of an active attacker.

In general, however, no, we cannot say that two arbitrary files having the same hash definitely means that they are identical.

The way a cryptographic hash function works is to take an arbitrary-length input, and output a fixed-length value computed from the input. Some hash functions have multiple output lengths to choose from, but the output is still to some degree a fixed-length value. This value will be up to a few dozen bytes long; the hash algorithms with the longest output value in common use today have a 512-bit output, and a 512-bit output is 64 bytes.

If an input to a hash function is longer than the output of the hash function, some fidelity must be removed to make the input fit in the output. Consequently, there must exist multiple inputs of lengths greater than the length of the output, which generate the same output.

Let's take the current workhorse, SHA-256, as an example. It outputs a hash of 256 bits, or 32 bytes. If you have two files which are each exactly 32 bytes long, but different, these should (assuming no flaw in the algorithm) hash to different values, no matter the content of the files; in mathematical terms, the hash is a function mapping a 2²⁵⁶ input space onto a 2²⁵⁶ output space, which should be possible to do without collisions. However, if you have two files that are each 33 bytes long, there must exist some combination of inputs that give the same 32-byte output hash value for both files, because we're now mapping a 2²⁶⁴ input space onto a 2²⁵⁶ output space; here, we can readily see that there should, on average, exist 2⁸ inputs for every single output. Take this further, and with 64-byte files there should exist 2²⁵⁶ inputs for every single output!

Cryptographic hash functions are designed such that it's computationally difficult to compose an input that gives a particular output, or compose two inputs that give the same output. This is known as preimage attack resistance or collision attack resistance. It's not impossible to find these collisions; it's just intended to be really, really, really, really hard. (A bit of a special case of a collision attack is a birthday attack.)

Some algorithms are better than others at resisting attackers. MD5 is generally considered completely broken these days, but last I looked, it still sported pretty good first preimage resistance. SHA-1 is likewise effectively broken; preimage attacks have been demonstrated, but require specific conditions, though there's no reason to believe that will be the case indefinitely; as the saying goes, attacks always get better, they never get worse. SHA-256/384/512 are currently still believed safe for most purposes. However, if you're just interested in seeing if two non-maliciously-crafted, valid files are the same, then any of these should be sufficient, because the input space is sufficiently constrained already that you'd be mostly interested in random collisions. If you have any reason to believe that the files were crafted maliciously, then you need to at the very least use a cryptographic hash function that is currently believed safe, which puts the lower bar at SHA-256.

First preimage is to find an input that yields a specific output hash value; second preimage is to find one input that gives the same output as another, specified input; collision is to find two inputs that yield the same output, without regard to what that is and sometimes without regard to what the inputs are.

All that said, it's important to keep in mind that the files may have very different data representations and still display exactly the same. So they can appear to be the same even though their cryptographic hashes don't match, but if the hashes match then they are extremely likely to appear the same.

Solution 3

It's a probability game... hashes are able to represent a finite number of values.

If we consider a hypothetical (and very weak) 8-bit hashing algorithm, then this can represent 256 distinct values. As you start to run files through the algorithm, you will start to get hashes out... but before long you will start to see "hash collisions". This means that two different files were fed into the algorithm, and it produced the same hash value as its output. Clearly here, the hash is not strong enough, and we cannot assert that "files with matching hashes have the same content".

Extending the size of the hash, and using stronger cryptographic hashing algorithms can significantly help to reduce collisions, and raise our confidence that two files with the same hash have the same content.

This said, we can never reach 100% certainty - we can never claim for sure that two files with the same hash truly have the same content.

In most / many situations this is fine, and comparing hashes is "good enough", but this depends on your threat model.

Ultimately, if you need to raise the certainty levels, then I would recommend that you do the following:

Use strong hashing algorithms (MD5 is no longer considered adequate if you need to protect against potentially malicious users)
Use multiple hashing algorithms
Compare the size of the files - an extra data point can help to identify potential collisions, but note that the demonstrated MD5 collision did not need to alter the data's length.

If you need to be 100% sure, then by all means start with a hash, but if the hashes match, follow it up with a byte-by-byte comparison of the two files.

Additionally, as pointed out by others... the complexity of documents produced by applications such as Word and Excel means that the text, numbers, visible layout can be the same, but the data stored in the file can be different.

Excel is particularly bad at this - simply opening a spreadsheet saving it (having done nothing) can produce a new file, with different content.

Solution 4

Short answer: A cryptographic hash is supposed to help you be reasonably confident that files with matching hashes are the same. Unless deliberately crafted, the chances of two slightly different files having similar hash values is ridiculously small. But when it comes to comparing and verifying files that could be deliberately tampered with, MD5 is poor choice. (Use another hash function like SHA3 or BLAKE2.)

Long answer: An ideal hash function is one that creates an almost unique cryptographic hash for a every unique piece of data. In other words, we definitely know that there are two files in this universe whose hash values collide, the chance of these two files naturally coming together is ridiculously small.

Ten years ago, I decided I must stay as far as I can from MD5. (Of course, until yesterday, I remembered the wrong reason for doing so; ten years is a long time, you see. I revisited my past memos to remember why and edited this answer.) You see, in 1996, MD5 was found to be susceptible to collision attacks. 9 years later, researchers were able to create pairs of PostScript documents and (ouch!) X.509 certificates with the same hash! MD5 was clearly broken. (Megaupload.com was also using MD5, and there was a lot of hanky-panky around hash collisions that gave me trouble at the time.)

So, I concluded that while MD5 was (and still is) reliable for comparing benign files, one must stop using it altogether. I reasoned that reliance on it has the risk of turning into indulgence and false confidence: Once you start comparing files using their MD5 hashes, one day you forget the security fineprint and compare two files that are deliberately crafted to have the same hash. In addition, CPUs and cryptoprocessors were unlikely to add support for it.

The original poster, however, has even less reasons to use MD5, because:

As long as one is comparing two files only, byte-for-byte comparison is actually faster than generating one's own MD5 hashes. For comparing three or more files... well, now you have a legitimate cause.
The OP specified "ways to review this and without installing a bunch of plugins". Windows PowerShell's Get-FileHash command can generate SHA1, SHA256, SHA384, SHA512 and MD5 hashes. On modern computers with hardware support for SHA hash functions, generating them is faster.

Solution 5

If two files have the same MD5 hash, and they haven't both been specially crafted, then they're identical. How hard it is to craft files with the same MD5 hash depends on the file format, I don't know how easy it is with Excel files.

So if you have files of your own that are just lying around and want to find duplicates, MD5 is safe. If you wrote one of the files, and the other file is of dubious origin, MD5 is still safe (the only way to get different files with the same MD5 checksum is to craft both files). If someone you don't trust sends you a budget proposal, and later sends another file which they claim is the same, then MD5 may not be enough.

To avoid any risk, use SHA-256 or SHA-512 instead of MD5. If two files have the same SHA-256 hash, then they're identical. The same goes for SHA-512. (There's a theoretical possibility that they could be different, but the probability of this happening accidentally is so much less than the probability of your computer flipping a bit during the verification than it just isn't relevant. As for someone deliberately crafting two files with the same hash, nobody knows how to do this for SHA-256 or SHA-512.)

If two Excel files have different hashes, then they're different, but there's no way to know by how much they differ. They could have identical data but different formatting, or they could just differ in the properties, or they might have been saved by different versions. In fact if Excel is anything like Word then merely saving a file updates its metadata. If you only want to compare the numerical and text data and ignore formatting and properties, you can export the spreadsheets to CSV to compare them.

If you have Unix/Linux tools available, then you can use cmp to compare two files. To compare two files on the same machine, checksums only make things more complicated.

View more solutions

31,449

sam

Updated on September 18, 2022

Comments

sam over 1 year

I have 2 excel documents and I want to check if they are exactly the same, apart from the file name.

For example, the files are called fileone.xls and filetwo.xls. Apart from the file names, their contents are presumed to be identical but this is what i want to check.

I've been looking for ways to review this and without installing a bunch of plugins. There doesn't seem a straight forward way.

I've tried generating MD5 hashes for both files. When the hashes are identical, does this mean that the file contents are 1:1 the same?
- dave_thompson_085 almost 6 years
  
  cryptohashes and sometimes even normal hashes can be useful for comparing files on different systems, or searching among large numbers of files, but if two files are on the same system you can easily just compare them with cmp on Unix or fc (file compare) on Windows.
- styrofoam fly almost 6 years
  
  shattered.io - SHA1 is a "stronger" hashing algorithm than md5 and still shattered.io/static/shattered-1.pdf and shattered.io/static/shattered-2.pdf have the same hash value while being completely different.
- Emilio M Bumachar almost 6 years
  
  Side note: check their sizes first. If they have different sizes, don't bother opening the files, they're different.
- Euro Micelli almost 6 years
  
  Simplistic version: an MD5 hash is good enough to protect against an accident, it is not good enough to prevent agains maliciousness. Whether that's good enough for you, you have to decide based on your circumstances.
- Bakuriu almost 6 years
  
  diff -s file1 file2 if it says they are identical, they are identical (it actually compares the files byte-per-byte so even hash collisions are excluded). checksums are used when you only have one hash and an item that is thought to be identical to the originator of that hash.
- Patrick Mevzek almost 6 years
  
  @EmilioMBumachar depends on the definition of "different". Bytes content may be different, but not semantic content. Example if you just add whitespaces after a final text. Or in some structured format if you have padding, that can be any length without any displayed content.
- technical_difficulty almost 6 years
  
  Pigeonhole Principle
- Acccumulation almost 6 years
  
  Comparing two files takes less computation than hashing them. Where hashes are useful is when you have a large number of files and want to check whether any pair are identical.
- Nonny Moose almost 6 years
  
  TL;DR: Probably.
- Konrad Rudolph almost 6 years
  
  @Bakuriu Or cmp -s, which is probably more efficient.
- David Rice almost 6 years
  
  What do you mean by their contents being identical? If I have two files, both with identical cell values but the fonts are different, are they identical? If I have two files where every cell value and styling is the same, but the file stores them in different orders, are they they same?
- b0fh almost 6 years
  
  Don't forget that some operating systems may store more than one data stream in a file. NTFS has alternate streams, *nix has posix extended user attributes, the old MacOS had the resource fork. So, if you are afraid of someone adding hidden information to a file, it's not enough to hash the main data stream.
- Eric Duminil almost 6 years
  
  @Acccumulation comparing two files over a network requires much less bandwidth with a hash, though.
Kamil Maciorowski almost 6 years

You can create your own cryptographic hash function of any length you choose, true; but then it has a fixed length and the pigeonhole principle applies anyway. The general answer is: "by comparing their hashes only, you cannot be sure the two files are identical".
Admin almost 6 years

@KamilMaciorowski In theory, yes I can. My custom-made hash function can simply generate a copy of the largest file. But I have no interest in discussing this further; the truth is, you downvoted for a reason that amounts to nitpicking just to prove you are smarter and it backfired on you. Now you can't take the vote back.
Attie almost 6 years

I agree with @KamilMaciorowski... It's a probability game... using a single hash, you can be "reasonably confident" that files with matching hashes are the same, but there is no 100% guarantee. Using better algorithms, or using multiple algorithms can improve your confidence - even comparing file sizes can help... but you can never be 100% confident without checking byte-for-byte.
Admin almost 6 years

@Attie Huh! That's what I originally meant. Thanks. 🙏 Only I am not familiar with chic phrases like "you can be reasonably confident". Sorry. 😜 Still, that's why we have an edit button. I personally would never trash a good answer just because one word in it is wrong. I edit it.
BeowulfNode42 almost 6 years

Excel files, and other office documents can also have different hashes because they have been opened and re-saved without changing anything, due to the metadata in the file having a new value stored in there for the last saved datetime.
Chris H almost 6 years

MD5 is no longer considered adequate is very true cryptographically but for uniqueness checking (in the absence of malice, e.g. if you control the input) it's nice and fast (and 128 bits should be plenty)
Kamil Maciorowski almost 6 years

About "trashing a good answer": please note I ensured first it's not a typo and you really mean it; then downvoted and at the same time I gave you feedback, disclosed my reason in a hope your answer will get better. It did, so my downvote is no more. Basically I told you what I think was wrong with your answer, Attie helped to clarify, you improved the answer. From my point of view we all handled this situation properly and the whole story turned out very well. Thank you.
Monty Harder almost 6 years

Bonus: if you've exported to CSV, you can use the venerable diff or similar utility to actually confirm the files are byte-for-byte identical, rather than just having the same hash.
Paŭlo Ebermann almost 6 years

"a hash collision with identical file sizes is less likely." – Actually, the collisions built for MD5 usually have same file size.
user2357112 almost 6 years

Taking a hash and creating data that matches the hash is a preimage attack. I believe MD5 is currently vulnerable to collision attacks, but I don't think preimage or second-preimage attacks are currently viable.
TripeHound almost 6 years

"follow it up with a byte-by-byte comparison of the two files." If you're going to do a file-comparison, you may as well do it first... no point reading all of each file to compute their hashes only to re-read both files to compare them!
Bakuriu almost 6 years

@Tim what are you saying? He said: export them to CSV and use diff -s to check whether the CSV are identical. In fact you can diff -s even the excel files: if diff says they are identical you don't need to go to CSV comparison.
Attie almost 6 years

@TripeHound It depends if the files are both local or not... if you already have a hash of one and are introducing a new file to the system, if the new file needs a hash stored in a database anyway, etc... Make the call that suits your situation.
Tim almost 6 years

@Bakuriu Clearly my comment was very poorly worded - I meant exporting to CSV will lose lots of information - notably formulae, charts, conditional and standard formatting.
Attie almost 6 years

@PaŭloEbermann I'd misremembered - thanks & fixed.
Attie almost 6 years

@ChrisH fair point, thanks - I've updated my answer.
Gilles 'SO- stop being evil' almost 6 years

If the hashes match then either the files are a result of a deliberate collision, or they aren't and then they are guaranteed to be the same. The probability of an accidental collision is purely theoretic. Saying that “if the hashes match then they are highly likely to appear the same” is misleading: if there's malice afoot and it's a collision situation then they aren't likely to be the same, and otherwise the probability is effectively zero, it isn't some low-probability event that needs to be defended against.
Gilles 'SO- stop being evil' almost 6 years

No, it is not a probability game. You're misestimating how unlikely an accidental collision is. It just won't happen. Flipping a bit during the comparison is more likely. On the other hand, in some scenarios, a deliberate collision might happen, and that is not a probability game at all.
Gilles 'SO- stop being evil' almost 6 years

The question is about MD5, which has no risk of accidental collisions. It does have a risk of deliberate collisions, but that's not a matter of probabilities.
Gilles 'SO- stop being evil' almost 6 years

The probability of accidental collisions is too low to take into account. The risk of a deliberate collision exists for MD5 as well and is worse than for SHA-1 which is not terribly relevant here.
mckenzm almost 6 years

It's also about excel spreadsheets with different names, how large can they be that a byte for byte comparison can't be an option? Two hashing schemes together would provide certainty.
Admin almost 6 years

@KamilMaciorowski Very well; I take that as an act of good faith.😊
Damon almost 6 years

@Gilles: On the contrary. Michael's wording is exactly right, and "guaranteed" is misleading (or, well, factually wrong). The likelihood of two files with identical hashes not matching (notwithstanding malicious modification) is extremely low, and can be neglected in practice. It is, however, not zero. There is generally a chance, that for whatever reason different inputs will produce the same hash, and possibly even with a likelihood much higher than 2^-128 (cryptographic algorithms are black art, the algortihm may be flawed in a subtle, unknown way and we have no way of being 100% sure).
user almost 6 years

I tweaked the answer slightly to try to address the comments above.
Attie almost 6 years

@Gilles "effectively zero" is still not zero, which means there is still some (admittedly small) probability that two different sets of data will result in the same hash. You can't argue against that.
mbrig almost 6 years

@Gilles no, as Attie says, its literally a probability game, based on how many bits are in the hash and how many files you are expecting to work with. A 32 bit hash would probably work just fine (barring malicious-ness) for an average desktop user (CRC32 is still popular for some kinds of downloaded video files), but not for a google-scale big data filesystem.
smls almost 6 years

"we cannot say that" -- We can say it with more certainty than pretty much every mundane belief that any of us holds true about the world. Saying it is true is a perfectly legitimate use of that word; or else nothing is.
supercat almost 6 years

@Attie: The probability of two unrelated files hashing to the same value is so far below the probability of many other things that can go wrong (e.g. random bit errors corrupting files on disk) that it's not worth guarding against coincidental matches. Guarding against deliberately-engineered matches may be worthwhile, but accidental matches are so improbable that any effort spent guarding against them could likely be spent better elsewhere.
supercat almost 6 years

@mbrig: A 32-bit hash would have a significant risk of accidental mismatch. Going to 128 or 256 bits, however, makes a huge difference. With 128 bits, a billion monkeys each typing a billion decently-sized genuinely-random documents would have about a 0.3% chance of creating two documents with the same hash. With 256 bits, even if billions monkeys could type a billion decently-sized random documents per second for a billion years, the likelihood of any of those nonillions of documents having coincidentally-matching hash values would be vanishingly small.
Gilles 'SO- stop being evil' almost 6 years

@mbrig With a CRC, it's partly a probability game. With a cryptographic hash, even a broken one, probability is irrelevant: you will never win that game. Conversely, no matter what the hash is, you need to figure out whether deliberate collisions are an issue, and that is not about probability at all.
Gilles 'SO- stop being evil' almost 6 years

@Damon The likelihood of an accidental MD5 collision is too small to matter. It would be meaningless to take it into account without taking more likely events into account, such as the probability of a RAM error leading to the computer reporting the wrong result. An accidental collision is guaranteed not to happen because it will not happen. If it was false that identical hashes guarantee that there is no accidental collisions, then a counterexample would exist.
iheanyi almost 6 years

@Gilles wrong. You can't in one breath tell me that there's a chance, however small you rate it, that an accidental collision may occur then in the very next grantee no collision can occur. Saying that is highly misleading as it implies a property of the hashing algorithm that is already known to be completely false.
user541686 almost 6 years

How do you write about the insecurity of MD5 without suggesting a better hash like SHA256...
LawrenceC almost 6 years

Honestly question is about comparison and not security.
Peter - Reinstate Monica almost 6 years

You could emphasize more that it is a logical necessity that several blobs of data share the same hash. That is independent of the hashing algorithm and the difficulty to produce collisions: There simply are many more variations of data (namely, an infinite amount) than of hashes of a given size (e.g. 2^128 for a 128 bit hash). The relation data -> hash is a true function (whih is probably, but not necessarily surjective). The relation hash -> data is not a function.
Konrad Rudolph almost 6 years

@MontyHarder No need for CSV export, you can diff the Excel files directly. In fact, I’d strongly recommend doing exactly that (or, rather, using cmp, which is much more efficient in case the files mismatch).
Damon almost 6 years

@Gilles: I am well aware that that likelihood is so low it can be ignored. However, "does not matter" and "does not exist" are not the same things. It is possible for a collision to occur, and indeed collisions are guaranteed to happen (with extremely low likelihood). Stating that something is guaranteed not to happen means none less than it is impossible (with absolute certitude). That's really what "guaranteed" means, but it is verifiably not what is the case.
Damon almost 6 years

As a not-quite-right analogy (because there are about 10 orders of magnitude in between the two cases), you can state that you are guaranteed not to win the lottery. If that's guaranteed, why do people play the lottery then? Because, you know, every other week or so, some fool does win. Think about it, who is the bigger fool, the fool who got rich by winning what couldn't be won, or the fool who was smart enough not to play? The odds may be ridiculously, unreasonably unlikely -- but as long as there is a way, guaranteed is the wrong word.
Gilles 'SO- stop being evil' almost 6 years

@Damon Where that analogy fails is that there are many orders of magnitude between the two cases. Winning the lottery is very unlikely, but it's still more likely as, say, getting hit by a meteorite. And that is still much more likely than finding an MD5 collision (IIRC there exists a human who has been hit by a meteorite, so it can't be that unlikely).
Monty Harder almost 6 years

@KonradRudolph but the two files could be identical in all meaningful ways and still have some metadata differences that cause diff to report that they are different binaries. The whole point of exporting to CSV was to remove that possibility.
user2357112 almost 6 years

Finding two inputs that give the same hash is a collision attack, not a second preimage attack. A second preimage attack takes one input and finds a second input that gives the same hash; a collision attack is much easier, because you don't have to match a specific target.
JITHIN JOSE almost 6 years

@LawrenceC, Is probability of two hashes colliding is constant? I think it will depend on data size, larger the data, the probability will be more. is't it?
user207421 almost 6 years

@Gilles All hashcodes have risk of accidental collisions, by definition. The only way out of that is to use the entire file as the hashcode. Your comment makes no sense.
Doktor J almost 6 years

@MontyHarder but you can get diff to tell you where the Excel files differ. If it's just some metadata garbage at the beginning or end of the file, you can then safely assume that the spreadsheet contents and formulae are otherwise reasonably identical.
David R Tribble almost 6 years

If the file lengths are the same, and the files have the same hashcodes for more than one hash (using different hashing algorithms, of course), then the odds are even better that the files are identical.
Lionize almost 6 years

@Damon: every week or two someone wins the lottery, but it is not the case that every week or two someone randomly generates an MD5 collision. It just depends whether Michael is talking about the abstract mathematical hash function (in which case we can talk about the probability), or actual real hash comparions on real computers (in which case it doesn't really make sense to insist on a probability of a random hash collision, but not insist on the probability of a false match due to a far more likely hardware error in computing the hashes).
Lionize almost 6 years

That said, I agree that "negligible" is a better way of putting it that what Gilles proposes. If someone is too dumb to know what the words negligible and/or probability mean, or is considering what to actually do in practice, then fall back to telling them that if the hashes match then the two cases to consider are that the files are the same or that it's an intentionally-generated collision. There is no point adding, "or a random MD5 collision; or a cosmic ray hit your RAM; or there was a previously undetected bug in sshlib", even though they technically are all candidates.
user almost 6 years

@user2357112 Fixed.
supercat almost 6 years

@Gilles: The only thing that would make it more of a probability game with CRC than some other hash is that CRCs are generally shorter. Otherwise, a CRC would often be less of a probability game because certain factors may create correlations. As a simple example, if Acme Spreadsheet ensures file integrity by storing a 32-bit CRC with every file, other tools to compute a files' CRC might report that all Acme Spreadsheet files have a CRC of zero since the appended CRC would be the pattern of bits that, when appended to the preceding content, would make the CRC of the combined file zero.
Ti Strga almost 6 years

If it helps any, Excel (and other Office files whose filename extensions are all 4-character .???x) are just XML trees stored as ZIP format. You can rename myletter.docx or mycharts.xlsx into mywhatever.zip and then expand the file to see how the format is arranged. So if you plan to diff two such files, you should use a skip-past-leading-offset count appropriate to ZIP data, along with all the other caveats that go along with diff'ing ZIP files.
Peter Cordes almost 6 years

Otherwise there would be no point in computing them. If you've broken the laws of mathematics and invented a lossless compression function that can compress random data, violating the pigeonhole principle, it would be very valuable to use it! It would be highly convenient if a 128-bit hash did uniquely represent the whole contents of a file. Even if there was no decompression function to turn the hash back into the file a mathematically-impossible collision-free hash would be nice to have, e.g. to speed up dup-finding in untrusted data like in VM images.
supercat almost 6 years

When using two files on one physical drive, using a hash function that can keep up with the I/O speed on each file separately may be slightly faster than comparing the files, since there would be no need to switch between reading the two files. The place hashes really shine, though, is when trying to do comparisons involving many files that are too large to fit in memory. Even if you merely want to find out if they all match, comparing file 1 to file 2, then file 1 to file 3, then file 1 to file 4, etc. may be almost twice as slow as computing all their hashes.
Andrew Henle almost 6 years

@supercat If the files are read in chunks larger than a MB or so, the switching between files won't be noticeable. And if a work flow involves comparing a bunch of files to find duplicates, the hash might as well be computed as each file is written - since doing it then can pretty much be done for free.
Andrew Henle almost 6 years

If two files have the same MD5 hash, and they haven't both been specially crafted, then they're identical. That is incorrect. There are an infinity of possible messages, yet there are only 2^64 possible 64-bit hashes. It's called the "pigeonhole principle": " the pigeonhole principle states that if n items are put into m containers, with n > m, then at least one container must contain more than one item." If you create more than 2^64 messages you will have collisions without any "special crafting". And you might with just 2.
Charles Duffy almost 6 years

@AndrewHenle, MD5 is not 64 bits, it's 128. If generating an accidental collision gets us into heat-death-of-the-universe timescales, it's "possible" only for an extremely academic (hence useless) definition thereof.
Andrew Henle almost 6 years

@CharlesDuffy You're assuming the hash is randomly distributed. It's not.
Charles Duffy almost 6 years

Being effectively equivalent to random distribution is part of the definition of what constitutes a good cryptographic hash -- you have a lot of rounds of mixing for a reason. Certainly, there are weak hash algorithms, but focusing on those weaknesses gets us into the previously-stated caveats around intentional attacks. (Or are you saying that MD5 has been shown to only have 64 bits that are effectively random? I'll admit that I haven't been keeping up, so that's plausible -- link please?)
Gilles 'SO- stop being evil' almost 6 years

@AndrewHenle I do not state that a collision is mathematically impossible, which would be wrong, but not relevant here. I state that it has not happened, which is true. Your comment is incorrect in a way that completely changes the deal. There are 2^128 possible MD5 hashes, not 2^64. This means you would need to generate 2^128 hashes to be certain to generate a collision. Actually, by the birthday paradox, 2^64 would give you a macroscopic chance of a collision between the hashes you generated (not with a previously-generated hash). But this is moot since we know how to craft collision.
Andrew Henle almost 6 years

If two files have the same MD5 hash, and they haven't both been specially crafted, then they're identical. That is a statement claiming the only way to get a collision is to "specially craft" one. That is wrong. Collisions are possible, no matter how unlikely. I state that it has not happened, which is true. Again, you're wrong. Collisions can happen. I've been witness to one.
Andrew Henle almost 6 years

@CharlesDuffy Or are you saying that MD5 has been shown to only have 64 bits that are effectively random? I changed 128 to 64 in order to fit the post limit - it was that tight. The principle remains the same, though.
Gilles 'SO- stop being evil' almost 6 years

@AndrewHenle If you've been a witness to an accidental MD5 collision, please publish it. You'd be the first. ”I changed 128 to 64 in order to fit the post limit - it was that tight. The principle remains the same, though.” You what? No, 2^128 is so many orders of magnitude larger than 2^64 that it make a qualitative difference. It's the difference between ”run your computer for a while“ and “not in your lifetime”.
supercat almost 6 years

If one has enough space to buffer large chunks of the files, the switching times need not be a problem, but otherwise they might be. As for computing the hashes when files are written, that may be fine if one could guarantee that files could not be modified without changing or at least invalidating stored hashes. If one is trying to avoid backing up files redundantly, looking only at stored hash values may cause one to back up an accidentally-corrupted file but not bother to back up the non-corrupted files which the corrupted file should match but doesn't.
Jason almost 6 years

In industrial applications where we need to be quite certain that data or application files are the same as expected, we compare the hash and the file size with expected values. I don't know whether comparing the file sizes adds a layer of security, but it's standard practice.
Thomas Weller almost 6 years

This won't work due to additional data stored in office files. You need to e.g. put the cursor in the same cell before saving, save at the exact time etc. But even then, XLSX files are zip files internally, so if that algorithm stores the individual files in a different order (for whatever purpose), the file is identical but the hash isn't
Thomas Weller almost 6 years

"If the hashes are different, it does mean that the contents are different." Not necessarily. XLSX files are ZIP files and it would be possible to have the same content stored in different file order.
Thomas Weller almost 6 years

Quite a link-only answer, but interesting.
Thomas Weller almost 6 years

"Once you find a difference, you know the files aren't identical" - not necessarily. XLSX files are ZIP files which potentially could store the content in different order still havng the same content. But even if you decompress them and compare each individual file, the XLSX file contains XML documents which might have e.g. different line endings without affecting the content.
Thomas Weller almost 6 years

@TiStrga: Correct. And remember that XML files may differ in whitespaces, attribute order, namespace prefixes, encoding etc. but still be the same.
mckenzm over 5 years

If you wanted to check a range of cells, you could accumulate a tree hash, but you would have to be keen.