Possible to calculate MD5 (or other) hash with buffered reads?

17,986

Solution 1

You use the TransformBlock and TransformFinalBlock methods to process the data in chunks.

// Init
MD5 md5 = MD5.Create();
int offset = 0;

// For each block:
offset += md5.TransformBlock(block, 0, block.Length, block, 0);

// For last block:
md5.TransformFinalBlock(block, 0, block.Length);

// Get the has code
byte[] hash = md5.Hash;

Note: It works (at least with the MD5 provider) to send all blocks to TransformBlock and then send an empty block to TransformFinalBlock to finalise the process.

Solution 2

I like the answer above but for the sake of completeness, and being a more general solution, refer to the CryptoStream class. If you are already handling streams, it is easy to wrap your stream in a CryptoStream, passing a HashAlgorithm as the ICryptoTransform parameter.

var file = new FileStream("foo.txt", FileMode.Open, FileAccess.Write);
var md5 = MD5.Create();
var cs = new CryptoStream(file, md5, CryptoStreamMode.Write);
while (notDoneYet)
{
    buffer = Get32MB();
    cs.Write(buffer, 0, buffer.Length);
}
System.Console.WriteLine(BitConverter.ToString(md5.Hash));

You might have to close the stream before getting the hash (so the HashAlgorithm knows it's done).

Solution 3

Seems you can to use TransformBlock / TransformFinalBlock, as shown in this sample: Displaying progress updates when hashing large files

Solution 4

Hash algorithms are expected to handle this situation and are typically implemented with 3 functions:

hash_init() - Called to allocate resources and begin the hash.
hash_update() - Called with new data as it arrives.
hash_final() - Complete the calculation and free resources.

Look at http://www.openssl.org/docs/crypto/md5.html or http://www.openssl.org/docs/crypto/sha.html for good, standard examples in C; I'm sure there are similar libraries for your platform.

Solution 5

I've just had to do something similar, but wanted to read the file asynchronously. It's using TransformBlock and TransformFinalBlock and is giving me answers consistent with Azure, so I think it is correct!

private static async Task<string> CalculateMD5Async(string fullFileName)
{
  var block = ArrayPool<byte>.Shared.Rent(8192);
  try
  {
     using (var md5 = MD5.Create())
     {
         using (var stream = new FileStream(fullFileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192, true))
         {
            int length;
            while ((length = await stream.ReadAsync(block, 0, block.Length).ConfigureAwait(false)) > 0)
            {
               md5.TransformBlock(block, 0, length, null, 0);
            }
            md5.TransformFinalBlock(block, 0, 0);
         }
         var hash = md5.Hash;
         return Convert.ToBase64String(hash);
      }
   }
   finally
   {
      ArrayPool<byte>.Shared.Return(block);
   }
}
Share:
17,986
Harry
Author by

Harry

Updated on June 13, 2022

Comments

  • Harry
    Harry about 2 years

    I need to calculate checksums of quite large files (gigabytes). This can be accomplished using the following method:

        private byte[] calcHash(string file)
        {
            System.Security.Cryptography.HashAlgorithm ha = System.Security.Cryptography.MD5.Create();
            FileStream fs = new FileStream(file, FileMode.Open, FileAccess.Read);
            byte[] hash = ha.ComputeHash(fs);
            fs.Close();
            return hash;
        }
    

    However, the files are normally written just beforehand in a buffered manner (say writing 32mb's at a time). I am so convinced that I saw an override of a hash function that allowed me to calculate a MD5 (or other) hash at the same time as writing, ie: calculating the hash of one buffer, then feeding that resulting hash into the next iteration.

    Something like this: (pseudocode-ish)

    byte [] hash = new byte [] { 0,0,0,0,0,0,0,0 };
    while(!eof)
    {
       buffer = readFromSourceFile();
       writefile(buffer);
       hash = calchash(buffer, hash);
    }
    

    hash is now sililar to what would be accomplished by running the calcHash function on the entire file.

    Now, I can't find any overrides like that in the.Net 3.5 Framework, am I dreaming ? Has it never existed, or am I just lousy at searching ? The reason for doing both writing and checksum calculation at once is because it makes sense due to the large files.

  • Pascal Cuoq
    Pascal Cuoq over 14 years
    Good answer, but the "where is it in .net?" part of the question remains open.
  • Adam Liss
    Adam Liss over 14 years
    Ok, but +1 for also providing a reference!
  • Adam Liss
    Adam Liss over 14 years
    @Pascal: See the 2 good answers below, both of which had been posted before your comment.
  • Harry
    Harry over 14 years
    Ay caramba! There it is! That was the function I was searching for. Good to know I wasn't making it all up. Thanks to Guffa and Rubens for providing the correct answer so promptly. +1 to you both, I will accept this answer because of the included code example.
  • Eamon Nerbonne
    Eamon Nerbonne about 13 years
    Note that you can equivalently replace the second instance of block by null in the call to TransformBlock; you don't actually want any copying to occur; the output parameter isn't actually doing anything with respect to the hashing.
  • Cumbayah
    Cumbayah over 12 years
    That link is dead, try this instead: infinitec.de/post/2007/06/09/…
  • RandomInsano
    RandomInsano over 12 years
    Also, TransformFinalBlock can take zero for the length.
  • Poul K. Sørensen
    Poul K. Sørensen over 8 years
    Is it possible to transform the first X blocks of data, dump the state data and then continue the next blocks after restoring state on a new calculation?. Having 100GB file in a cloud solution, it would be nice to be able to not have to go over the hole file in one go. machines could recycle ect.
  • Guffa
    Guffa over 8 years
    @pksorensen: I don't think so, I don't see any methods or properties for getting or setting the computional state of the MD5 object. In theory it's of course possible, but you might need to use a separate implementation of the algorithm so that you can add methods for handling the state.
  • Shimmy Weitzhandler
    Shimmy Weitzhandler over 6 years
    What's ArrayPool?
  • Shimmy Weitzhandler
    Shimmy Weitzhandler over 6 years
    OK got it: ArrayPool, need to install package System.Buffers.
  • Khale_Kitha
    Khale_Kitha over 2 years
    This is useful, but not a .net 3.5 solution