How to compare 2 files fast using .NET?

209,943

Solution 1

A checksum comparison will most likely be slower than a byte-by-byte comparison.

In order to generate a checksum, you'll need to load each byte of the file, and perform processing on it. You'll then have to do this on the second file. The processing will almost definitely be slower than the comparison check.

As for generating a checksum: You can do this easily with the cryptography classes. Here's a short example of generating an MD5 checksum with C#.

However, a checksum may be faster and make more sense if you can pre-compute the checksum of the "test" or "base" case. If you have an existing file, and you're checking to see if a new file is the same as the existing one, pre-computing the checksum on your "existing" file would mean only needing to do the DiskIO one time, on the new file. This would likely be faster than a byte-by-byte comparison.

Solution 2

The slowest possible method is to compare two files byte by byte. The fastest I've been able to come up with is a similar comparison, but instead of one byte at a time, you would use an array of bytes sized to Int64, and then compare the resulting numbers.

Here's what I came up with:

    const int BYTES_TO_READ = sizeof(Int64);

    static bool FilesAreEqual(FileInfo first, FileInfo second)
    {
        if (first.Length != second.Length)
            return false;

        if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
            return true;

        int iterations = (int)Math.Ceiling((double)first.Length / BYTES_TO_READ);

        using (FileStream fs1 = first.OpenRead())
        using (FileStream fs2 = second.OpenRead())
        {
            byte[] one = new byte[BYTES_TO_READ];
            byte[] two = new byte[BYTES_TO_READ];

            for (int i = 0; i < iterations; i++)
            {
                 fs1.Read(one, 0, BYTES_TO_READ);
                 fs2.Read(two, 0, BYTES_TO_READ);

                if (BitConverter.ToInt64(one,0) != BitConverter.ToInt64(two,0))
                    return false;
            }
        }

        return true;
    }

In my testing, I was able to see this outperform a straightforward ReadByte() scenario by almost 3:1. Averaged over 1000 runs, I got this method at 1063ms, and the method below (straightforward byte by byte comparison) at 3031ms. Hashing always came back sub-second at around an average of 865ms. This testing was with an ~100MB video file.

Here's the ReadByte and hashing methods I used, for comparison purposes:

    static bool FilesAreEqual_OneByte(FileInfo first, FileInfo second)
    {
        if (first.Length != second.Length)
            return false;

        if (string.Equals(first.FullName, second.FullName, StringComparison.OrdinalIgnoreCase))
            return true;

        using (FileStream fs1 = first.OpenRead())
        using (FileStream fs2 = second.OpenRead())
        {
            for (int i = 0; i < first.Length; i++)
            {
                if (fs1.ReadByte() != fs2.ReadByte())
                    return false;
            }
        }

        return true;
    }

    static bool FilesAreEqual_Hash(FileInfo first, FileInfo second)
    {
        byte[] firstHash = MD5.Create().ComputeHash(first.OpenRead());
        byte[] secondHash = MD5.Create().ComputeHash(second.OpenRead());

        for (int i=0; i<firstHash.Length; i++)
        {
            if (firstHash[i] != secondHash[i])
                return false;
        }
        return true;
    }

Solution 3

If you d̲o̲ decide you truly need a full byte-by-byte comparison (see other answers for discussion of hashing), then the easiest solution is:


• for `System.String` path names:
public static bool AreFileContentsEqual(String path1, String path2) =>
              File.ReadAllBytes(path1).SequenceEqual(File.ReadAllBytes(path2));

• for `System.IO.FileInfo` instances:
public static bool AreFileContentsEqual(FileInfo fi1, FileInfo fi2) =>
    fi1.Length == fi2.Length &&
    (fi1.Length == 0L || File.ReadAllBytes(fi1.FullName).SequenceEqual(
                         File.ReadAllBytes(fi2.FullName)));

Unlike some other posted answers, this is conclusively correct for any kind of file: binary, text, media, executable, etc., but as a full binary comparison, files that that differ only in "unimportant" ways (such as BOM, line-ending, character encoding, media metadata, whitespace, padding, source code comments, etc.note 1) will always be considered not-equal.

This code loads both files into memory entirely, so it should not be used for comparing truly gigantic files. Beyond that important caveat, full loading isn't really a penalty given the design of the .NET GC (because it's fundamentally optimized to keep small, short-lived allocations extremely cheap), and in fact could even be optimal when file sizes are expected to be less than 85K, because using a minimum of user code (as shown here) implies maximally delegating file performance issues to the CLR, BCL, and JIT to benefit from (e.g.) the latest design technology, system code, and adaptive runtime optimizations.

Furthermore, for such workaday scenarios, concerns about the performance of byte-by-byte comparison via LINQ enumerators (as shown here) are moot, since hitting the disk a̲t̲ a̲l̲l̲ for file I/O will dwarf, by several orders of magnitude, the benefits of the various memory-comparing alternatives. For example, even though SequenceEqual does in fact give us the "optimization" of abandoning on first mismatch, this hardly matters after having already fetched the files' contents, each fully necessary for any true-positive cases.



1. An obscure exception: NTFS alternate data streams are not examined by any of the answers discussed on this page and thus may differ for files otherwise considered the "same."

Solution 4

In addition to Reed Copsey's answer:

  • The worst case is where the two files are identical. In this case it's best to compare the files byte-by-byte.

  • If if the two files are not identical, you can speed things up a bit by detecting sooner that they're not identical.

For example, if the two files are of different length then you know they cannot be identical, and you don't even have to compare their actual content.

Solution 5

It's getting even faster if you don't read in small 8 byte chunks but put a loop around, reading a larger chunk. I reduced the average comparison time to 1/4.

    public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2)
    {
        bool result;

        if (fileInfo1.Length != fileInfo2.Length)
        {
            result = false;
        }
        else
        {
            using (var file1 = fileInfo1.OpenRead())
            {
                using (var file2 = fileInfo2.OpenRead())
                {
                    result = StreamsContentsAreEqual(file1, file2);
                }
            }
        }

        return result;
    }

    private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
    {
        const int bufferSize = 1024 * sizeof(Int64);
        var buffer1 = new byte[bufferSize];
        var buffer2 = new byte[bufferSize];

        while (true)
        {
            int count1 = stream1.Read(buffer1, 0, bufferSize);
            int count2 = stream2.Read(buffer2, 0, bufferSize);

            if (count1 != count2)
            {
                return false;
            }

            if (count1 == 0)
            {
                return true;
            }

            int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
            for (int i = 0; i < iterations; i++)
            {
                if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                {
                    return false;
                }
            }
        }
    }
}
Share:
209,943
Robin Rodricks
Author by

Robin Rodricks

Updated on July 08, 2022

Comments

  • Robin Rodricks
    Robin Rodricks almost 2 years

    Typical approaches recommend reading the binary via FileStream and comparing it byte-by-byte.

    • Would a checksum comparison such as CRC be faster?
    • Are there any .NET libraries that can generate a checksum for a file?
  • Henk Holterman
    Henk Holterman almost 15 years
    To be complete: the other big gain is stopping as soon as the bytes at 1 position are different.
  • Feidex
    Feidex almost 15 years
    @Henk: I thought this was too obvious :-)
  • Reed Copsey
    Reed Copsey almost 15 years
    Good point on adding this. It was obvious to me, so I didn't include it, but it's good to mention.
  • RandomInsano
    RandomInsano over 12 years
    Would disk thrashing cause problems here?
  • RandomInsano
    RandomInsano over 12 years
    Wouldn't you also need to store both files in memory?
  • RandomInsano
    RandomInsano over 12 years
    Array.Equals goes deeper into the system, so it will likely be a lot faster than going byte by byte in C#. I can't speak for Microsoft, but deep down, Mono uses C's memcpy() command for array equality. Can't get much faster than that.
  • porges
    porges over 12 years
    In general the check count1 != count2 isn't correct. Stream.Read() can return less than the count you have provided, for various reasons.
  • Kim
    Kim about 12 years
    Make sure to take into account where your files are located. If you're comparing local files to a back-up half-way across the world (or over a network with horrible bandwidth) you may be better off to hash first and send a checksum over the network instead of sending a stream of bytes to compare.
  • digEmAll
    digEmAll over 9 years
    @ReedCopsey: I'm having a similar problem, since I need to store input/output files produced by several elaborations that are supposed to contain a lot of duplications. I thought to use precomputed hash, but do you think I can reasonably assume that if 2 (e.g. MD5) hash are equal, the 2 files are equal and avoid further byte-2-byte comparison ? As far as I know MD5/SHA1 etc collisions are really unlikely...
  • Reed Copsey
    Reed Copsey over 9 years
    @digEmAll Collision chance is low - you can always do a stronger hash, though - ie: use SHA256 instead of SHA1, which will reduce the likelihood of collisions further.
  • causa prima
    causa prima over 9 years
    Use a larger hash and you can get the odds of a false positive to well below the odds the computer erred while doing the test.
  • causa prima
    causa prima over 9 years
    I disagree about the hash time vs seek time. You can do a lot of calculations during a single head seek. If the odds are high that the files match I would use a hash with a lot of bits. If there's a reasonable chance of a match I would compare them a block at a time, for say 1MB blocks. (Pick a block size that 4k divides evenly to ensure you never split sectors.)
  • anindis
    anindis over 9 years
    You made my life easier. Thank you
  • chsh
    chsh over 9 years
    @anindis: For completeness, you may want to read both @Lars' answer and @RandomInsano's answer. Glad it helped so many years on though! :)
  • Doug Clutter
    Doug Clutter about 9 years
    Using Array.Equals is a bad idea because it compares the whole array. It is likely, at least one block read will not fill the whole array.
  • RandomInsano
    RandomInsano about 9 years
    Why is comparing the whole array a bad idea? Why would a block read not fill the array? There's definitely a good tuning point, but that's why you play with the sizes. Extra points for doing the comparison in a separate thread.
  • Doug Clutter
    Doug Clutter about 9 years
    When you define a byte array, it will have a fixed length. (e.g. - var buffer = new byte[4096]) When you read a block from the file, it may or may not return the full 4096 bytes. For instance, if the file is only 3000 bytes long.
  • RandomInsano
    RandomInsano about 9 years
    Ah, now I understand! Good news is the read will return the number of bytes loaded into the array, so if the array can't be filled, there will be data. Since we're testing for equality, old buffer data won't matter. Docs: msdn.microsoft.com/en-us/library/9kstw824(v=vs.110).aspx
  • RandomInsano
    RandomInsano about 9 years
    Also important, my recommendation to use the Equals() method is a bad idea. In Mono, they do a memory compare since the elements are contiguous in memory. Microsoft however doesn't override it, instead only doing a reference comparison which here would always be false.
  • Ian Mercer
    Ian Mercer about 9 years
    The FilesAreEqual_Hash method should have a using on both file streams too like the ReadByte method otherwise it will hang on to both files.
  • BenKoshy
    BenKoshy over 8 years
    thanks for your answer - i'm just getting into .net. i'm assuming that if one is using the hashcode/check sum technique, then the hashes of the main folder will be stored persistently somewhere? out of curiousity how would you store it for a WPF application - what would you do? (i've currently looking at xml, text files or databases).
  • Palec
    Palec over 8 years
    Note that FileStream.Read() may actually read less bytes than the requested number. You should use StreamReader.ReadBlock() instead.
  • Glenn Slayden
    Glenn Slayden about 8 years
    To explain @Guffa's figure 99.99999998%, it comes from computing 1 - (1 / (2^32)), which is the probability that any single file will have some given 32-bit hash. The probability of two different files having the same hash is the same, because the first file provides the "given" hash value, and we only need to consider whether or not the other file matches that value. The chances with 64- and 128-bit hashing decrease to 99.999999999999999994% and 99.9999999999999999999999999999999999997% (respectively), as if that matters with such unfathomable numbers.
  • TheLegendaryCopyCoder
    TheLegendaryCopyCoder over 7 years
    Physical disk drives yes, SSD's would handle this.
  • SQL Police
    SQL Police about 7 years
    @RandomInsano guess you mean memcmp(), not memcpy()
  • crokusek
    crokusek about 7 years
    In the Int64 version when the stream length is not a multiple of Int64 then the last iteration is comparing the unfilled bytes using previous iteration's fill (which should also be equal so it's fine). Also if the stream length is less than sizeof(Int64) then the unfilled bytes are 0 since C# initializes arrays. IMO, the code should probably comment these oddities.
  • Krypto_47
    Krypto_47 about 7 years
    this one does not look good for big files. not good for memory usage since it will read both files up to the end before starting comparing the byte array. That is why i would rather go for a streamreader with a buffer.
  • Glenn Slayden
    Glenn Slayden about 7 years
    @Krypto_47 I discussed these factors and the approprate use in the text of my answer.
  • Dan Bechard
    Dan Bechard about 7 years
    It's worth noting that as long as you don't override it in the constructor, the default size of FileStream's internal buffer is already a large block (4096 bytes). While there's a bit of overhead calling Read(), it's not actually hitting the disk every time. referencesource.microsoft.com/#mscorlib/system/io/…
  • Dan Bechard
    Dan Bechard about 7 years
    @chsh I modified your code slightly to add a short-circuit case for when the two files being compared are the same file (in case the caller forgets to check). This shouldn't affect the comparative performance stats in any measurable way for a 100MB file.
  • Glenn Slayden
    Glenn Slayden over 6 years
    ...Indeed, the fact that these numbers are harder for most people to grasp than the putatively simple notion, albeit true, of "infinitely many files colliding into same hash code" may explain why humans are unreasonably suspicious of accepting hash-as-equality.
  • DaedalusAlpha
    DaedalusAlpha over 6 years
    Note that File also has the function ReadAllBytes which can use SequenceEquals as well so use that instead as it would work on all files. And as @RandomInsano said, this is stored in memory so while it's perferctly fine to use for small files I would be careful using it with large files.
  • Andrew Arnott
    Andrew Arnott over 6 years
    I modified the code above to allow for case insensitivity to not throw off the filename fast-path. But there is still a bug as @palec called out. @crokusek suggested this is fine, but it really isn't. It's not just when the stream ends that the Read method may return fewer bytes than requested. The contract is that Read must not return 0 bytes unless the end of the stream is read, but mid-stream, it can also return (0,max] bytes which buffering streams may actually do. So it really is important to make sure to consider that case.
  • IS4
    IS4 about 6 years
    @DaedalusAlpha It returns an enumerable, so the lines will be loaded on-demand and not stored in memory the whole time. ReadAllBytes, on the other hand, does return the whole file as an array.
  • frogpelt
    frogpelt about 5 years
    You say if (i>=secondHash.Length ... Under what circumstances would two MD5 hashes be different lengths?
  • Simon
    Simon over 4 years
    wouldnt the bitconverter bit be better as ``` for (var i = 0; i < count; i+= sizeof(long)) { if (BitConverter.ToInt64(buffer1, i) != BitConverter.ToInt64(buffer2, i)) { return false; } } ```
  • ghd
    ghd over 2 years
    As noted by @Palec in a previous comment, the int64 version may fail to give correct results since FileStream.Read() can read less than the requested number of bytes.