How to compare 2 files fast using .NET?
Solution 1
A checksum comparison will most likely be slower than a byte-by-byte comparison.
In order to generate a checksum, you'll need to load each byte of the file, and perform processing on it. You'll then have to do this on the second file. The processing will almost definitely be slower than the comparison check.
As for generating a checksum: You can do this easily with the cryptography classes. Here's a short example of generating an MD5 checksum with C#.
However, a checksum may be faster and make more sense if you can pre-compute the checksum of the "test" or "base" case. If you have an existing file, and you're checking to see if a new file is the same as the existing one, pre-computing the checksum on your "existing" file would mean only needing to do the DiskIO one time, on the new file. This would likely be faster than a byte-by-byte comparison.
Solution 2
The slowest possible method is to compare two files byte by byte. The fastest I've been able to come up with is a similar comparison, but instead of one byte at a time, you would use an array of bytes sized to Int64, and then compare the resulting numbers.
Here's what I came up with:
const int BYTES_TO_READ = sizeof(Int64);
static bool FilesAreEqual(FileInfo first, FileInfo second)
{
if (first.Length != second.Length)
return false;
if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
return true;
int iterations = (int)Math.Ceiling((double)first.Length / BYTES_TO_READ);
using (FileStream fs1 = first.OpenRead())
using (FileStream fs2 = second.OpenRead())
{
byte[] one = new byte[BYTES_TO_READ];
byte[] two = new byte[BYTES_TO_READ];
for (int i = 0; i < iterations; i++)
{
fs1.Read(one, 0, BYTES_TO_READ);
fs2.Read(two, 0, BYTES_TO_READ);
if (BitConverter.ToInt64(one,0) != BitConverter.ToInt64(two,0))
return false;
}
}
return true;
}
In my testing, I was able to see this outperform a straightforward ReadByte() scenario by almost 3:1. Averaged over 1000 runs, I got this method at 1063ms, and the method below (straightforward byte by byte comparison) at 3031ms. Hashing always came back sub-second at around an average of 865ms. This testing was with an ~100MB video file.
Here's the ReadByte and hashing methods I used, for comparison purposes:
static bool FilesAreEqual_OneByte(FileInfo first, FileInfo second)
{
if (first.Length != second.Length)
return false;
if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
return true;
using (FileStream fs1 = first.OpenRead())
using (FileStream fs2 = second.OpenRead())
{
for (int i = 0; i < first.Length; i++)
{
if (fs1.ReadByte() != fs2.ReadByte())
return false;
}
}
return true;
}
static bool FilesAreEqual_Hash(FileInfo first, FileInfo second)
{
byte[] firstHash = MD5.Create().ComputeHash(first.OpenRead());
byte[] secondHash = MD5.Create().ComputeHash(second.OpenRead());
for (int i=0; i<firstHash.Length; i++)
{
if (firstHash[i] != secondHash[i])
return false;
}
return true;
}
Solution 3
If you d̲o̲ decide you truly need a full byte-by-byte comparison (see other answers for discussion of hashing), then the easiest solution is:
• for `System.String` path names:
public static bool AreFileContentsEqual(String path1, String path2) =>
File.ReadAllBytes(path1).SequenceEqual(File.ReadAllBytes(path2));
• for `System.IO.FileInfo` instances:
public static bool AreFileContentsEqual(FileInfo fi1, FileInfo fi2) =>
fi1.Length == fi2.Length &&
(fi1.Length == 0L || File.ReadAllBytes(fi1.FullName).SequenceEqual(
File.ReadAllBytes(fi2.FullName)));
Unlike some other posted answers, this is conclusively correct for any kind of file: binary, text, media, executable, etc., but as a full binary comparison, files that that differ only in "unimportant" ways (such as BOM, line-ending, character encoding, media metadata, whitespace, padding, source code comments, etc.note 1) will always be considered not-equal.
This code loads both files into memory entirely, so it should not be used for comparing truly gigantic files. Beyond that important caveat, full loading isn't really a penalty given the design of the .NET GC (because it's fundamentally optimized to keep small, short-lived allocations extremely cheap), and in fact could even be optimal when file sizes are expected to be less than 85K, because using a minimum of user code (as shown here) implies maximally delegating file performance issues to the CLR
, BCL
, and JIT
to benefit from (e.g.) the latest design technology, system code, and adaptive runtime optimizations.
Furthermore, for such workaday scenarios, concerns about the performance of byte-by-byte comparison via LINQ
enumerators (as shown here) are moot, since hitting the disk a̲t̲ a̲l̲l̲ for file I/O will dwarf, by several orders of magnitude, the benefits of the various memory-comparing alternatives. For example, even though SequenceEqual
does in fact give us the "optimization" of abandoning on first mismatch, this hardly matters after having already fetched the files' contents, each fully necessary for any true-positive cases.
1. An obscure exception: NTFS alternate data streams are not examined by any of the answers discussed on this page and thus may differ for files otherwise considered the "same."
Solution 4
In addition to Reed Copsey's answer:
The worst case is where the two files are identical. In this case it's best to compare the files byte-by-byte.
If if the two files are not identical, you can speed things up a bit by detecting sooner that they're not identical.
For example, if the two files are of different length then you know they cannot be identical, and you don't even have to compare their actual content.
Solution 5
It's getting even faster if you don't read in small 8 byte chunks but put a loop around, reading a larger chunk. I reduced the average comparison time to 1/4.
public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2)
{
bool result;
if (fileInfo1.Length != fileInfo2.Length)
{
result = false;
}
else
{
using (var file1 = fileInfo1.OpenRead())
{
using (var file2 = fileInfo2.OpenRead())
{
result = StreamsContentsAreEqual(file1, file2);
}
}
}
return result;
}
private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
{
const int bufferSize = 1024 * sizeof(Int64);
var buffer1 = new byte[bufferSize];
var buffer2 = new byte[bufferSize];
while (true)
{
int count1 = stream1.Read(buffer1, 0, bufferSize);
int count2 = stream2.Read(buffer2, 0, bufferSize);
if (count1 != count2)
{
return false;
}
if (count1 == 0)
{
return true;
}
int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
for (int i = 0; i < iterations; i++)
{
if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
{
return false;
}
}
}
}
}
Robin Rodricks
Updated on July 08, 2022Comments
-
Robin Rodricks almost 2 years
Typical approaches recommend reading the binary via FileStream and comparing it byte-by-byte.
- Would a checksum comparison such as CRC be faster?
- Are there any .NET libraries that can generate a checksum for a file?
-
Henk Holterman almost 15 yearsTo be complete: the other big gain is stopping as soon as the bytes at 1 position are different.
-
Feidex almost 15 years@Henk: I thought this was too obvious :-)
-
Reed Copsey almost 15 yearsGood point on adding this. It was obvious to me, so I didn't include it, but it's good to mention.
-
RandomInsano over 12 yearsWould disk thrashing cause problems here?
-
RandomInsano over 12 yearsWouldn't you also need to store both files in memory?
-
RandomInsano over 12 yearsArray.Equals goes deeper into the system, so it will likely be a lot faster than going byte by byte in C#. I can't speak for Microsoft, but deep down, Mono uses C's memcpy() command for array equality. Can't get much faster than that.
-
porges over 12 yearsIn general the check
count1 != count2
isn't correct.Stream.Read()
can return less than the count you have provided, for various reasons. -
Kim about 12 yearsMake sure to take into account where your files are located. If you're comparing local files to a back-up half-way across the world (or over a network with horrible bandwidth) you may be better off to hash first and send a checksum over the network instead of sending a stream of bytes to compare.
-
digEmAll over 9 years@ReedCopsey: I'm having a similar problem, since I need to store input/output files produced by several elaborations that are supposed to contain a lot of duplications. I thought to use precomputed hash, but do you think I can reasonably assume that if 2 (e.g. MD5) hash are equal, the 2 files are equal and avoid further byte-2-byte comparison ? As far as I know MD5/SHA1 etc collisions are really unlikely...
-
Reed Copsey over 9 years@digEmAll Collision chance is low - you can always do a stronger hash, though - ie: use SHA256 instead of SHA1, which will reduce the likelihood of collisions further.
-
causa prima over 9 yearsUse a larger hash and you can get the odds of a false positive to well below the odds the computer erred while doing the test.
-
causa prima over 9 yearsI disagree about the hash time vs seek time. You can do a lot of calculations during a single head seek. If the odds are high that the files match I would use a hash with a lot of bits. If there's a reasonable chance of a match I would compare them a block at a time, for say 1MB blocks. (Pick a block size that 4k divides evenly to ensure you never split sectors.)
-
anindis over 9 yearsYou made my life easier. Thank you
-
chsh over 9 years@anindis: For completeness, you may want to read both @Lars' answer and @RandomInsano's answer. Glad it helped so many years on though! :)
-
Doug Clutter about 9 yearsUsing Array.Equals is a bad idea because it compares the whole array. It is likely, at least one block read will not fill the whole array.
-
RandomInsano about 9 yearsWhy is comparing the whole array a bad idea? Why would a block read not fill the array? There's definitely a good tuning point, but that's why you play with the sizes. Extra points for doing the comparison in a separate thread.
-
Doug Clutter about 9 yearsWhen you define a byte array, it will have a fixed length. (e.g. - var buffer = new byte[4096]) When you read a block from the file, it may or may not return the full 4096 bytes. For instance, if the file is only 3000 bytes long.
-
RandomInsano about 9 yearsAh, now I understand! Good news is the read will return the number of bytes loaded into the array, so if the array can't be filled, there will be data. Since we're testing for equality, old buffer data won't matter. Docs: msdn.microsoft.com/en-us/library/9kstw824(v=vs.110).aspx
-
RandomInsano about 9 yearsAlso important, my recommendation to use the Equals() method is a bad idea. In Mono, they do a memory compare since the elements are contiguous in memory. Microsoft however doesn't override it, instead only doing a reference comparison which here would always be false.
-
Ian Mercer about 9 yearsThe
FilesAreEqual_Hash
method should have ausing
on both file streams too like theReadByte
method otherwise it will hang on to both files. -
BenKoshy over 8 yearsthanks for your answer - i'm just getting into .net. i'm assuming that if one is using the hashcode/check sum technique, then the hashes of the main folder will be stored persistently somewhere? out of curiousity how would you store it for a WPF application - what would you do? (i've currently looking at xml, text files or databases).
-
Palec over 8 yearsNote that
FileStream.Read()
may actually read less bytes than the requested number. You should useStreamReader.ReadBlock()
instead. -
Glenn Slayden about 8 yearsTo explain @Guffa's figure 99.99999998%, it comes from computing
1 - (1 / (2^32))
, which is the probability that any single file will have some given 32-bit hash. The probability of two different files having the same hash is the same, because the first file provides the "given" hash value, and we only need to consider whether or not the other file matches that value. The chances with 64- and 128-bit hashing decrease to 99.999999999999999994% and 99.9999999999999999999999999999999999997% (respectively), as if that matters with such unfathomable numbers. -
TheLegendaryCopyCoder over 7 yearsPhysical disk drives yes, SSD's would handle this.
-
SQL Police about 7 years@RandomInsano guess you mean memcmp(), not memcpy()
-
crokusek about 7 yearsIn the Int64 version when the stream length is not a multiple of Int64 then the last iteration is comparing the unfilled bytes using previous iteration's fill (which should also be equal so it's fine). Also if the stream length is less than sizeof(Int64) then the unfilled bytes are 0 since C# initializes arrays. IMO, the code should probably comment these oddities.
-
Krypto_47 about 7 yearsthis one does not look good for big files. not good for memory usage since it will read both files up to the end before starting comparing the byte array. That is why i would rather go for a streamreader with a buffer.
-
Glenn Slayden about 7 years@Krypto_47 I discussed these factors and the approprate use in the text of my answer.
-
Dan Bechard about 7 yearsIt's worth noting that as long as you don't override it in the constructor, the default size of FileStream's internal buffer is already a large block (4096 bytes). While there's a bit of overhead calling Read(), it's not actually hitting the disk every time. referencesource.microsoft.com/#mscorlib/system/io/…
-
Dan Bechard about 7 years@chsh I modified your code slightly to add a short-circuit case for when the two files being compared are the same file (in case the caller forgets to check). This shouldn't affect the comparative performance stats in any measurable way for a 100MB file.
-
Glenn Slayden over 6 years...Indeed, the fact that these numbers are harder for most people to grasp than the putatively simple notion, albeit true, of "infinitely many files colliding into same hash code" may explain why humans are unreasonably suspicious of accepting hash-as-equality.
-
DaedalusAlpha over 6 yearsNote that File also has the function ReadAllBytes which can use SequenceEquals as well so use that instead as it would work on all files. And as @RandomInsano said, this is stored in memory so while it's perferctly fine to use for small files I would be careful using it with large files.
-
Andrew Arnott over 6 yearsI modified the code above to allow for case insensitivity to not throw off the filename fast-path. But there is still a bug as @palec called out. @crokusek suggested this is fine, but it really isn't. It's not just when the stream ends that the
Read
method may return fewer bytes than requested. The contract is thatRead
must not return 0 bytes unless the end of the stream is read, but mid-stream, it can also return(0,max]
bytes which buffering streams may actually do. So it really is important to make sure to consider that case. -
IS4 about 6 years@DaedalusAlpha It returns an enumerable, so the lines will be loaded on-demand and not stored in memory the whole time. ReadAllBytes, on the other hand, does return the whole file as an array.
-
frogpelt about 5 yearsYou say
if (i>=secondHash.Length ...
Under what circumstances would two MD5 hashes be different lengths? -
Simon over 4 yearswouldnt the bitconverter bit be better as ``` for (var i = 0; i < count; i+= sizeof(long)) { if (BitConverter.ToInt64(buffer1, i) != BitConverter.ToInt64(buffer2, i)) { return false; } } ```
-
ghd over 2 yearsAs noted by @Palec in a previous comment, the int64 version may fail to give correct results since
FileStream.Read()
can read less than the requested number of bytes.