Using hashlib to compute md5 digest of a file in Python 3

python python-3.x hashlib

25,512

Solution 1

I think you wanted the for-loop to make successive calls to f.read(128). That can be done using iter() and functools.partial():

import hashlib
from functools import partial

def md5sum(filename):
    with open(filename, mode='rb') as f:
        d = hashlib.md5()
        for buf in iter(partial(f.read, 128), b''):
            d.update(buf)
    return d.hexdigest()

print(md5sum('utils.py'))

Solution 2

for buf in f.read(128):
  d.update(buf)

.. updates the hash sequentially with each of the first 128 bytes values of the file. Since iterating over a bytes produces int objects, you get the following calls which cause the error you encountered in Python3.

d.update(97)
d.update(98)
d.update(99)
d.update(100)

which is not what you want.

Instead, you want:

def md5sum(filename):
  with open(filename, mode='rb') as f:
    d = hashlib.md5()
    while True:
      buf = f.read(4096) # 128 is smaller than the typical filesystem block
      if not buf:
        break
      d.update(buf)
    return d.hexdigest()

Solution 3

I finally changed my code to the version below (that I find easy to understand) after asking the question. But I will probably change it to the version suggested by Raymond Hetting unsing functools.partial.

import hashlib

def chunks(filename, chunksize):
    f = open(filename, mode='rb')
    buf = "Let's go"
    while len(buf):
        buf = f.read(chunksize)
        yield buf

def md5sum(filename):
    d = hashlib.md5()
    for buf in chunks(filename, 128):
        d.update(buf)
    return d.hexdigest()

25,512

kriss

Updated on February 12, 2020

Comments

kriss over 4 years
With python 2.7 the following code computes the mD5 hexdigest of the content of a file.

(EDIT: well, not really as answers have shown, I just thought so).
```
import hashlib

def md5sum(filename):
    f = open(filename, mode='rb')
    d = hashlib.md5()
    for buf in f.read(128):
        d.update(buf)
    return d.hexdigest()
```
Now if I run that code using python3 it raise a TypeError Exception:
```
    d.update(buf)
TypeError: object supporting the buffer API required
```
I figured out that I could make that code run with both python2 and python3 changing it to:
```
def md5sum(filename):
    f = open(filename, mode='r')
    d = hashlib.md5()
    for buf in f.read(128):
        d.update(buf.encode())
    return d.hexdigest()
```
Now I still wonder why the original code stopped working. It seems that when opening a file using the binary mode modifier it returns integers instead of strings encoded as bytes (I say that because type(buf) returns int). Is this behavior explained somewhere ?
- jfs over 12 years
  
  related: stackoverflow.com/q/4949162
- rakslice almost 11 years
  
  Would it be faster if you did larger reads, closer to the filesystem's file block size? (for instance 1024 bytes on Linux ext3 and 4096 bytes or more on Windows NTFS)
Umur Kontacı over 12 years

This is eat the whole RAM if you open a huge file. That's why we buffer.
phihag over 12 years

@fastreload Already added that ;). Since the original solution didn't even work for files with >128 bytes, I don't think memory is an issue, but I added a buffered read anyway.
Umur Kontacı over 12 years

Well done then, yet OP claimed that he could use his code in Python 2.x and stopped working on 3.x. And I remember I made 1 byte buffer for calculating md5 of 3 gb iso file for benchmarking and it did not fail. My bet is, python 2.7 has a failsafe mechanism that whatever the user input is, the minimum buffer size does not go below a certain level. What do you say?
phihag over 12 years

@fastreload The code didn't crash in Python 2 since iterating over a str produced str. The result was still wrong for files larger than 128 Bytes. Sure, you can adjust the buffer size as you want (unless you have a fast SSD, the CPU will get bored anyway, and good OSs preload the next bytes of the file). Python 2.7 has definitely no such failsafe mechanism; that would violate the contract of read. The OP did just not compare the results of the script with the canonical md5sum's, or the results of the script on two files that with 128 identical first bytes.
kriss over 12 years

Yes, that's exactly what I was trying to do. I finally achieved that with a less elegant solution than yours using a generator.
kriss over 12 years

yes, my original code is indeed broken (but not yet in the wild). I just didn't tested it on large files with the same beginning. I should have guessed there was a real problem as it was running way too fast.
phihag over 12 years

This leaks the file handle on some Python implementations. You should at least call close.
jfs over 12 years

I've added with statement to close the file properly.
Mapio over 12 years

This will now work if the file lenght is not a multiple of chunksize, read will infact return a shorter buffer in the last read. The termination is given by an empty buffer, that's why the "not buf" condition in the example code above (that works).
kriss over 12 years

@Mapio: there is indeed a kind of bug in my code, but not at all where you say. The file length is irrelevant. The code above works provided there is no partial read returning incomplete buffers. If a partial read occurs, it will stop too soon (but taking the partial buffer into account). A partial read may occur in some case, say if the program receive a managed interrupt signal while reading, then continue with read after returning from interruption.
kriss over 12 years

@phihag: is there really python implementation where the automatic close actually leaks file handles ? I thought it merely delayed the releasing of these file handles until garbage collection ?
kriss over 12 years

but with statement is indeed nice anyway
phihag over 12 years

@kriss Oops, you're right - close gets called eventually, even on Jython. However, that's only the case if you don't have an exception stacktrace lying around in sys.exc_info (for example if a read failed), so it's good form to call close or use the with statement.
kriss over 12 years

Well, in the above comment, when speaking of "code above" I'm refering to the old version. This current one is now working (even if it's not the best possible solution).
Raymond Hettinger over 12 years

This answer is incorrect when it says, "iterating over a bytes produces str objs". list(b'abc') --> [97, 98, 99]
phihag over 12 years

@RaymondHettinger Oops, stupid me. Tested it in 2.7 and was surprised to get strs - duh. Fixed.
Raymond Hettinger almost 12 years

@J.F.Sebastian Adding the with-statement "improved" the code at the expense of obfuscating the answer to the OP's question. A lot of people get confused or distracted by with-statement semantics, so it doesn't belong in an answer addressing iteration fundamentals. People who get hung-up on "leaking file handles" are wasting their time on something that almost never matters in real code. The with-statement is nice, but automatic file closing is a separate topic that isn't worth the distraction from an otherwise clear answer to a basic question about reading files in chunks.
jfs almost 12 years

@RaymondHettinger: if you don't like it; just revert the change. I've considered it to be a too minor change to discuss. Though I strongly disagree with your reasoning. Public code should follow best practices especially if it is aimed for beginners. If best practices are too hard to follow (though I don't think it is the case) for such a common task then the language should change.