Using hashlib to compute md5 digest of a file in Python 3
Solution 1
I think you wanted the for-loop to make successive calls to f.read(128)
. That can be done using iter() and functools.partial():
import hashlib
from functools import partial
def md5sum(filename):
with open(filename, mode='rb') as f:
d = hashlib.md5()
for buf in iter(partial(f.read, 128), b''):
d.update(buf)
return d.hexdigest()
print(md5sum('utils.py'))
Solution 2
for buf in f.read(128):
d.update(buf)
.. updates the hash sequentially with each of the first 128 bytes values of the file. Since iterating over a bytes
produces int
objects, you get the following calls which cause the error you encountered in Python3.
d.update(97)
d.update(98)
d.update(99)
d.update(100)
which is not what you want.
Instead, you want:
def md5sum(filename):
with open(filename, mode='rb') as f:
d = hashlib.md5()
while True:
buf = f.read(4096) # 128 is smaller than the typical filesystem block
if not buf:
break
d.update(buf)
return d.hexdigest()
Solution 3
I finally changed my code to the version below (that I find easy to understand) after asking the question. But I will probably change it to the version suggested by Raymond Hetting unsing functools.partial.
import hashlib
def chunks(filename, chunksize):
f = open(filename, mode='rb')
buf = "Let's go"
while len(buf):
buf = f.read(chunksize)
yield buf
def md5sum(filename):
d = hashlib.md5()
for buf in chunks(filename, 128):
d.update(buf)
return d.hexdigest()
Related videos on Youtube
kriss
Updated on February 12, 2020Comments
-
kriss over 4 years
With python 2.7 the following code computes the mD5 hexdigest of the content of a file.
(EDIT: well, not really as answers have shown, I just thought so).
import hashlib def md5sum(filename): f = open(filename, mode='rb') d = hashlib.md5() for buf in f.read(128): d.update(buf) return d.hexdigest()
Now if I run that code using python3 it raise a TypeError Exception:
d.update(buf) TypeError: object supporting the buffer API required
I figured out that I could make that code run with both python2 and python3 changing it to:
def md5sum(filename): f = open(filename, mode='r') d = hashlib.md5() for buf in f.read(128): d.update(buf.encode()) return d.hexdigest()
Now I still wonder why the original code stopped working. It seems that when opening a file using the binary mode modifier it returns integers instead of strings encoded as bytes (I say that because type(buf) returns int). Is this behavior explained somewhere ?
-
jfs over 12 yearsrelated: stackoverflow.com/q/4949162
-
rakslice almost 11 yearsWould it be faster if you did larger reads, closer to the filesystem's file block size? (for instance 1024 bytes on Linux ext3 and 4096 bytes or more on Windows NTFS)
-
-
Umur Kontacı over 12 yearsThis is eat the whole RAM if you open a huge file. That's why we buffer.
-
phihag over 12 years@fastreload Already added that ;). Since the original solution didn't even work for files with >128 bytes, I don't think memory is an issue, but I added a buffered read anyway.
-
Umur Kontacı over 12 yearsWell done then, yet OP claimed that he could use his code in Python 2.x and stopped working on 3.x. And I remember I made 1 byte buffer for calculating md5 of 3 gb iso file for benchmarking and it did not fail. My bet is, python 2.7 has a failsafe mechanism that whatever the user input is, the minimum buffer size does not go below a certain level. What do you say?
-
phihag over 12 years@fastreload The code didn't crash in Python 2 since iterating over a
str
producedstr
. The result was still wrong for files larger than 128 Bytes. Sure, you can adjust the buffer size as you want (unless you have a fast SSD, the CPU will get bored anyway, and good OSs preload the next bytes of the file). Python 2.7 has definitely no such failsafe mechanism; that would violate the contract ofread
. The OP did just not compare the results of the script with the canonicalmd5sum
's, or the results of the script on two files that with 128 identical first bytes. -
kriss over 12 yearsYes, that's exactly what I was trying to do. I finally achieved that with a less elegant solution than yours using a generator.
-
kriss over 12 yearsyes, my original code is indeed broken (but not yet in the wild). I just didn't tested it on large files with the same beginning. I should have guessed there was a real problem as it was running way too fast.
-
phihag over 12 yearsThis leaks the file handle on some Python implementations. You should at least call
close
. -
jfs over 12 yearsI've added
with
statement to close the file properly. -
Mapio over 12 yearsThis will now work if the file lenght is not a multiple of chunksize, read will infact return a shorter buffer in the last read. The termination is given by an empty buffer, that's why the "not buf" condition in the example code above (that works).
-
kriss over 12 years@Mapio: there is indeed a kind of bug in my code, but not at all where you say. The file length is irrelevant. The code above works provided there is no partial read returning incomplete buffers. If a partial read occurs, it will stop too soon (but taking the partial buffer into account). A partial read may occur in some case, say if the program receive a managed interrupt signal while reading, then continue with read after returning from interruption.
-
kriss over 12 years@phihag: is there really python implementation where the automatic close actually leaks file handles ? I thought it merely delayed the releasing of these file handles until garbage collection ?
-
kriss over 12 yearsbut with statement is indeed nice anyway
-
phihag over 12 years@kriss Oops, you're right - close gets called eventually, even on Jython. However, that's only the case if you don't have an exception stacktrace lying around in
sys.exc_info
(for example if aread
failed), so it's good form to callclose
or use thewith
statement. -
kriss over 12 yearsWell, in the above comment, when speaking of "code above" I'm refering to the old version. This current one is now working (even if it's not the best possible solution).
-
Raymond Hettinger over 12 yearsThis answer is incorrect when it says, "iterating over a bytes produces str objs". list(b'abc') --> [97, 98, 99]
-
phihag over 12 years@RaymondHettinger Oops, stupid me. Tested it in 2.7 and was surprised to get
str
s - duh. Fixed. -
Raymond Hettinger almost 12 years@J.F.Sebastian Adding the with-statement "improved" the code at the expense of obfuscating the answer to the OP's question. A lot of people get confused or distracted by with-statement semantics, so it doesn't belong in an answer addressing iteration fundamentals. People who get hung-up on "leaking file handles" are wasting their time on something that almost never matters in real code. The with-statement is nice, but automatic file closing is a separate topic that isn't worth the distraction from an otherwise clear answer to a basic question about reading files in chunks.
-
jfs almost 12 years@RaymondHettinger: if you don't like it; just revert the change. I've considered it to be a too minor change to discuss. Though I strongly disagree with your reasoning. Public code should follow best practices especially if it is aimed for beginners. If best practices are too hard to follow (though I don't think it is the case) for such a common task then the language should change.