What is the fastest way to create a checksum for large files in C#

118,281

Solution 1

The problem here is that SHA256Managed reads 4096 bytes at a time (inherit from FileStream and override Read(byte[], int, int) to see how much it reads from the filestream), which is too small a buffer for disk IO.

To speed things up (2 minutes for hashing 2 Gb file on my machine with SHA256, 1 minute for MD5) wrap FileStream in BufferedStream and set reasonably-sized buffer size (I tried with ~1 Mb buffer):

// Not sure if BufferedStream should be wrapped in using block
using(var stream = new BufferedStream(File.OpenRead(filePath), 1200000))
{
    // The rest remains the same
}

Solution 2

Don't checksum the entire file, create checksums every 100mb or so, so each file has a collection of checksums.

Then when comparing checksums, you can stop comparing after the first different checksum, getting out early, and saving you from processing the entire file.

It'll still take the full time for identical files.

Solution 3

As Anton Gogolev noted, FileStream reads 4096 bytes at a time by default, But you can specify any other value using the FileStream constructor:

new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 16 * 1024 * 1024)

Note that Brad Abrams from Microsoft wrote in 2004:

there is zero benefit from wrapping a BufferedStream around a FileStream. We copied BufferedStream’s buffering logic into FileStream about 4 years ago to encourage better default performance

source

Solution 4

Invoke the windows port of md5sum.exe. It's about two times as fast as the .NET implementation (at least on my machine using a 1.2 GB file)

public static string Md5SumByProcess(string file) {
    var p = new Process ();
    p.StartInfo.FileName = "md5sum.exe";
    p.StartInfo.Arguments = file;            
    p.StartInfo.UseShellExecute = false;
    p.StartInfo.RedirectStandardOutput = true;
    p.Start();
    p.WaitForExit();           
    string output = p.StandardOutput.ReadToEnd();
    return output.Split(' ')[0].Substring(1).ToUpper ();
}

Solution 5

Ok - thanks to all of you - let me wrap this up:

  1. using a "native" exe to do the hashing took time from 6 Minutes to 10 Seconds which is huge.
  2. Increasing the buffer was even faster - 1.6GB file took 5.2 seconds using MD5 in .Net, so I will go with this solution - thanks again
Share:
118,281
crono
Author by

crono

Hello there!

Updated on January 22, 2020

Comments

  • crono
    crono over 4 years

    I have to sync large files across some machines. The files can be up to 6GB in size. The sync will be done manually every few weeks. I cant take the filename into consideration because they can change anytime.

    My plan is to create checksums on the destination PC and on the source PC and then copy all files with a checksum, which are not already in the destination, to the destination. My first attempt was something like this:

    using System.IO;
    using System.Security.Cryptography;
    
    private static string GetChecksum(string file)
    {
        using (FileStream stream = File.OpenRead(file))
        {
            SHA256Managed sha = new SHA256Managed();
            byte[] checksum = sha.ComputeHash(stream);
            return BitConverter.ToString(checksum).Replace("-", String.Empty);
        }
    }
    

    The Problem was the runtime:
    - with SHA256 with a 1,6 GB File -> 20 minutes
    - with MD5 with a 1,6 GB File -> 6.15 minutes

    Is there a better - faster - way to get the checksum (maybe with a better hash function)?

  • crono
    crono almost 15 years
    WOW - using md5sums.exe from pc-tools.net/win32/md5sums makes it really fast. 1681457152 bytes, 8672 ms = 184.91 MB/sec -> 1,6GB ~ 9 seconds This will be fast enough for my purpose.
  • crono
    crono almost 15 years
    Yes - I will try to increase the buffer - like Anton Gogolev sugested. I ran it through a "native" MD5.exe which took 9 seconds witth a 1,6 GB file.
  • crono
    crono almost 15 years
    I like the idea, but it will not work in my scenario because I will end up with a lot of unchanged files over the time.
  • crono
    crono almost 15 years
    OK - this made the diffence - hashing the 1.6GB file with MD5 took 5.2 seconds on my box (QuadCode @2.6 GHz, 8GB Ram) - even faster as the native implementaion...
  • Christian Casutt
    Christian Casutt over 14 years
    i don't get it. i just tried this suggestion but the difference is minimal to nothing. 1024mb file w/o buffering 12-14 secs, with buffering also 12-14 secs - i understand that reading hundreds of 4k blocks will produce more IO but i ask myself if the framework or the native APIs below the framework do not handle this already..
  • buddybubble
    buddybubble almost 10 years
    I dont get it. How can this test contradict the accepted answer from Anton Gogolev?
  • Jmoney38
    Jmoney38 almost 10 years
    A very pessimistic view... ;-)
  • videoguy
    videoguy over 8 years
    Can you add description of each field in your data?
  • Smith
    Smith almost 8 years
    how do you checksum every 100mb of a file?
  • Reyhn
    Reyhn over 7 years
    A little late to the party, but for FileStreams there is no longer any need to wrap the stream in a BufferedStream as it is nowadays already done in the FileStream itself. Source
  • Taegost
    Taegost almost 7 years
    I was just going through this issue with smaller files (<10MB, but taking forever to get an MD5). Even though I use .Net 4.5, switching to this method with the BufferedStream cut the hash time down from about 8.6 seconds to <300 ms for an 8.6MB file
  • Tomer W
    Tomer W almost 7 years
    Very Smart, like your way of thought... will use that in my Duplicate killer app :)
  • Hugo Woesthuis
    Hugo Woesthuis over 6 years
    I used a BufferedStream /w 512 kB instead of 1024 kB. The 1.8 GB file was solved in 30 seconds.
  • Sarthak Mittal
    Sarthak Mittal over 6 years
    how about CRC32? I know chances of collision will probably increase, but if we are willing to overlook that aspect, is it faster than MD5/SHA?
  • stricq
    stricq about 6 years
    I set the buffer size to the same size as the file being hashed. Hashing almost 33,000 files went from about 30 minutes to under 1 minute. The hash loop is run in parallel and the max file size is 128 Megabytes but the smallest is just a few Kilobytes. RAM usage dropped considerably compared to using a fixed buffer size.
  • b.kiener
    b.kiener almost 6 years
    Not a good idea when using checksum for security reasons, because attacker can just change that bytes you have excluded.
  • Nathan Goings
    Nathan Goings almost 6 years
    +1 This is an excellent idea when you are performing a one-to-one comparison. Unfortunately, I'm using the MD5 hash as an index to look for unique files among many duplicates (many-to-many checks).
  • Soroush Falahati
    Soroush Falahati over 5 years
    @b.kiener No byte is excluded. You misunderstood him.