Generating an MD5 checksum of a file

399,115

Solution 1

You can use hashlib.md5()

Note that sometimes you won't be able to fit the whole file in memory. In that case, you'll have to read chunks of 4096 bytes sequentially and feed them to the md5 method:

import hashlib
def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

Note: hash_md5.hexdigest() will return the hex string representation for the digest, if you just need the packed bytes use return hash_md5.digest(), so you don't have to convert back.

Solution 2

There is a way that's pretty memory inefficient.

single file:

import hashlib
def file_as_bytes(file):
    with file:
        return file.read()

print hashlib.md5(file_as_bytes(open(full_path, 'rb'))).hexdigest()

list of files:

[(fname, hashlib.md5(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

Recall though, that MD5 is known broken and should not be used for any purpose since vulnerability analysis can be really tricky, and analyzing any possible future use your code might be put to for security issues is impossible. IMHO, it should be flat out removed from the library so everybody who uses it is forced to update. So, here's what you should do instead:

[(fname, hashlib.sha256(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

If you only want 128 bits worth of digest you can do .digest()[:16].

This will give you a list of tuples, each tuple containing the name of its file and its hash.

Again I strongly question your use of MD5. You should be at least using SHA1, and given recent flaws discovered in SHA1, probably not even that. Some people think that as long as you're not using MD5 for 'cryptographic' purposes, you're fine. But stuff has a tendency to end up being broader in scope than you initially expect, and your casual vulnerability analysis may prove completely flawed. It's best to just get in the habit of using the right algorithm out of the gate. It's just typing a different bunch of letters is all. It's not that hard.

Here is a way that is more complex, but memory efficient:

import hashlib

def hash_bytestr_iter(bytesiter, hasher, ashexstr=False):
    for block in bytesiter:
        hasher.update(block)
    return hasher.hexdigest() if ashexstr else hasher.digest()

def file_as_blockiter(afile, blocksize=65536):
    with afile:
        block = afile.read(blocksize)
        while len(block) > 0:
            yield block
            block = afile.read(blocksize)


[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.md5()))
    for fname in fnamelst]

And, again, since MD5 is broken and should not really ever be used anymore:

[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.sha256()))
    for fname in fnamelst]

Again, you can put [:16] after the call to hash_bytestr_iter(...) if you only want 128 bits worth of digest.

Solution 3

I'm clearly not adding anything fundamentally new, but added this answer before I was up to commenting status, plus the code regions make things more clear -- anyway, specifically to answer @Nemo's question from Omnifarious's answer:

I happened to be thinking about checksums a bit (came here looking for suggestions on block sizes, specifically), and have found that this method may be faster than you'd expect. Taking the fastest (but pretty typical) timeit.timeit or /usr/bin/time result from each of several methods of checksumming a file of approx. 11MB:

$ ./sum_methods.py
crc32_mmap(filename) 0.0241742134094
crc32_read(filename) 0.0219960212708
subprocess.check_output(['cksum', filename]) 0.0553209781647
md5sum_mmap(filename) 0.0286180973053
md5sum_read(filename) 0.0311000347137
subprocess.check_output(['md5sum', filename]) 0.0332629680634
$ time md5sum /tmp/test.data.300k
d3fe3d5d4c2460b5daacc30c6efbc77f  /tmp/test.data.300k

real    0m0.043s
user    0m0.032s
sys     0m0.010s
$ stat -c '%s' /tmp/test.data.300k
11890400

So, looks like both Python and /usr/bin/md5sum take about 30ms for an 11MB file. The relevant md5sum function (md5sum_read in the above listing) is pretty similar to Omnifarious's:

import hashlib
def md5sum(filename, blocksize=65536):
    hash = hashlib.md5()
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            hash.update(block)
    return hash.hexdigest()

Granted, these are from single runs (the mmap ones are always a smidge faster when at least a few dozen runs are made), and mine's usually got an extra f.read(blocksize) after the buffer is exhausted, but it's reasonably repeatable and shows that md5sum on the command line is not necessarily faster than a Python implementation...

EDIT: Sorry for the long delay, haven't looked at this in some time, but to answer @EdRandall's question, I'll write down an Adler32 implementation. However, I haven't run the benchmarks for it. It's basically the same as the CRC32 would have been: instead of the init, update, and digest calls, everything is a zlib.adler32() call:

import zlib
def adler32sum(filename, blocksize=65536):
    checksum = zlib.adler32("")
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            checksum = zlib.adler32(block, checksum)
    return checksum & 0xffffffff

Note that this must start off with the empty string, as Adler sums do indeed differ when starting from zero versus their sum for "", which is 1 -- CRC can start with 0 instead. The AND-ing is needed to make it a 32-bit unsigned integer, which ensures it returns the same value across Python versions.

Solution 4

In Python 3.8+, you can can use the assignment operator := (along with hashlib) like this:

import hashlib
with open("your_filename.txt", "rb") as f:
    file_hash = hashlib.md5()
    while chunk := f.read(8192):
        file_hash.update(chunk)

print(file_hash.digest())
print(file_hash.hexdigest())  # to get a printable str instead of bytes

Consider using hashlib.blake2b instead of md5 (just replace md5 with blake2b in the above snippet). It's cryptographically secure and faster than MD5.

Solution 5

hashlib.md5(pathlib.Path('path/to/file').read_bytes()).hexdigest()
Share:
399,115
Alexander
Author by

Alexander

Updated on July 08, 2022

Comments

  • Alexander
    Alexander almost 2 years

    Is there any simple way of generating (and checking) MD5 checksums of a list of files in Python? (I have a small program I'm working on, and I'd like to confirm the checksums of the files).

  • Alexander
    Alexander over 13 years
    I'm only using MD5 to confirm the file isn't corrupted. I'm not so concerned about it being broken.
  • President James K. Polk
    President James K. Polk over 13 years
    @TheLifelessOne: And despite @Omnifarious scary warnings, that is perfectly good use of MD5.
  • Omnifarious
    Omnifarious over 13 years
    @GregS, @TheLifelessOne - Yeah, and next thing you know someone finds a way to use this fact about your application to cause a file to be accepted as uncorrupted when it isn't the file you're expecting at all. No, I stand by my scary warnings. I think MD5 should be removed or come with deprecation warnings.
  • Austin S.
    Austin S. over 11 years
    While @quantumSoup has a viable answer, I believe this one should be selected as the proper method for retrieving a files md5 checksum. However, it could be simplified to "hashlib.md5(open(fname, 'r').read()).digest()". You should note that the "file" function was changed to "open" for use with python 2.7+
  • Omnifarious
    Omnifarious over 11 years
    @AustinS.: nod Yeah. I fixed it to say open. I believe that's worked ever since hashlib was introduced, and possible has always worked. Old habits die hard.
  • zbstof
    zbstof over 11 years
    I'd probably use .hexdigest() instead of .digest() - it's easier for humans to read - which is the purpose of OP.
  • Omnifarious
    Omnifarious over 11 years
    @Zotov: I would remove hexdigest from the standard hashlib hash function interface. I feel that it's an unnecessary wart. And I like making even small functions widely applicable. There are many cases in which the hex of the hash is quite unnecessarily verbose and making that the easiest to use version is encouraging people to be verbose when they don't have to be. But yes, in this case, for this specific purpose it is likely the better choice. I would still just use binascii.hexlify instead. :-)
  • BlueCoder
    BlueCoder about 11 years
    I used this solution but it uncorrectly gave the same hash for two different pdf files. The solution was to open the files by specifing binary mode, that is: [(fname, hashlib.md5(open(fname, 'rb').read()).hexdigest()) for fname in fnamelst] This is more related to the open function than md5 but I thought it might be useful to report it given the requirement for cross-platform compatibility stated above (see also: docs.python.org/2/tutorial/…).
  • Omnifarious
    Omnifarious about 11 years
    @BlueCoder: Oh, you're right. I should've done that. I'm so used to Unix where the two are synonymous. I'll fix it now.
  • hyde
    hyde about 11 years
    @Omnifarious Saying "remove MD5 from the Python library" or even just saying "add deprecation warning to Python library" is like saying "Python should not be used, if existing stuff requires MD5, please use something else". Explain security implications in docs, sure, but removal or even just deprecation is insane suggestion.
  • Omnifarious
    Omnifarious about 11 years
    @hyde: Something has to be done to get people to stop using that stupid algorithm. I've had jobs where they persisted in using it even after I demonstrated that it created security holes (admittedly rather obscure ones) and that SHA had a faster implementation in OpenSSL, which was the library we were using. It's insane.
  • Nemo
    Nemo over 10 years
    Any way for this to be at most one order of magnitude slower than md5sum on the command line?
  • Larpon
    Larpon over 9 years
    For people using the def hashfile function above multiple times on the same file handle remember to reset the afile pointer when done reading each file. eg. afile.seek(0)
  • Ed Randall
    Ed Randall about 9 years
    Could you possibly add a couple of lines comparing SHA1, and also zlib.adler32 maybe?
  • Jason S
    Jason S about 9 years
    Reminder: the known weaknesses for MD5 are collision attacks, and not preimage attacks, so it is suitable for some cryptographic applications but not others. If you don't know the difference you shouldn't be using it, but don't discard it altogether. See vpnc.org/hash.html.
  • koddo
    koddo over 8 years
    is it ok to not close opened files in those list comprehensions?
  • hyperknot
    hyperknot over 8 years
    Yes, I wanted to ask the same thing. Isn't a close() missing here?
  • MikeW
    MikeW over 8 years
    @EdRandall: adler32 is really not worth bothering with, eg. leviathansecurity.com/blog/analysis-of-adler32
  • Roman Shapovalov
    Roman Shapovalov over 7 years
    No, it is not okay. The files will be closed on garbage collection, likely in the end of the enclosing function. If, for example, the number of elements in fnamelist is greater than the limit set by your OS, it will fail. But that is irrelevant to the question asked. We should use SO to get the gist, not copy the snippets blindly. :)
  • max
    max over 7 years
    @BlueCoder How did it happen that two different pdf files had the same hash, even if opened without mode=rb? Shouldn't rt simply convert newlines and otherwise be identical to rb? (I assume this is python 2, since in python 3 hashlib.md5 requires bytes, and will simply refuse to accept a string,)
  • Omnifarious
    Omnifarious over 7 years
    @RomanShapovalov - I was relying on the reference counted nature of Python objects. After each element of the list comprehension is evaluated, there are no more references to it. I do agree that's rather tenuous and relying overly much on implementation. :-/ I like the interface for hashfile though, it's more flexible because it handles anything that has read.
  • Omnifarious
    Omnifarious over 7 years
    @RomanShapovalov - I fixed it so that it no longer has a potential resource leak, even though the current CPython implementation doesn't. I agree that it should avoid leaking even on Jython or future possible implementations of CPython.
  • Omnifarious
    Omnifarious over 6 years
    @JasonS - I can stick my hand in liquid nitrogen briefly and it won't be harmed. That doesn't mean I should do it. There are lots of alternatives to MD5 that are widely available. There is no more reason for anybody to use it than there is for me to stick my hand in liquid nitrogen.
  • Jason S
    Jason S over 6 years
    Nope. Sorry. Bad analogy.
  • Omnifarious
    Omnifarious over 6 years
    @JasonS - Can you give a rational reason anybody should use MD5 that's not one of these two: "Well, I think I can get away with it in this circumstance." or "I have to interoperate with something else that uses MD5."?
  • Jason S
    Jason S over 6 years
    The entirety of life is about "I think I can get away with it in this circumstance" --- or more objectively stated, risk management, which applies to all cryptographic systems, MD5 and SHA1 included. Read up on the state-of-the-art on MD5 preimage attacks. I don't put bars on all my windows at home, and I use MD5 when I am doing garden-variety integrity checks where a malicious adversary is not present (e.g. copying files from one PC to another)
  • Jason S
    Jason S over 6 years
    web.archive.org/web/20150901084550/http://www.vpnc.org/… -- "The difference between a collision attack and either of the two preimage attacks is crucial. At the time of this writing, there are no practical preimage attacks, meaning that if your use of hashes is only susceptible to preimage attacks, even MD5 is just fine because at attacker would have to make 2^128 guesses, which will be infeasable for many decades (if ever)."
  • Omnifarious
    Omnifarious over 6 years
    @JasonS - And in so doing, you are perpetuating the use and very existence of an algorithm that is broken for a wide variety of other uses. Using a proper algorithm isn't like putting bars on your windows. Using the right algorithm is a matter of typing a few letters differently. There is no good reason to use MD5 at all for anything. It has no quality that recommends it over SHA256 in any reasonable situation.
  • Jason S
    Jason S over 6 years
    I'm not continuing this discussion, you're just being ideological about your rejection of MD5.
  • Omnifarious
    Omnifarious over 6 years
    @JasonS - I would argue that you are being ideological in your refusal to reject an algorithm that has perfectly viable replacements that there is no good reason whatsoever to not use. "I learned to type MD5 darn it, and nobody is going to tell me I can't. Those other letters, they're weird and my fingers can't type them!"
  • Khanh Le
    Khanh Le over 6 years
    I just need to correct the same image, thus, using hashlib.md5(open(full_path, 'rb').read()).hexdigest() is good enough. Thanks!
  • Omnifarious
    Omnifarious over 6 years
    @LittleZero - Is md5 that much easier to type than sha256? I'm just poking at this, because it's better to just forget the broken algorithm ever existed, no matter how safe it is to use in certain contexts. Retrain yourself to never even think of using the broken algorithm, and then you won't end up using it when it matters.
  • Rohit Taneja
    Rohit Taneja about 6 years
    We should release resources. Open file with with statement or write code to close file.
  • Omnifarious
    Omnifarious about 6 years
    @RohitTaneja - Resources are being released. The file object is immediately associated with a with statement inside file_as_blockiter.
  • Rohit Taneja
    Rohit Taneja about 6 years
    @Omnifarious I am talking about the first 3 code snippets. EX import hashlib [(fname, hashlib.md5(open(fname, 'rb').read()).digest()) for fname in fnamelst]
  • Omnifarious
    Omnifarious about 6 years
    @RohitTaneja - Ahh, the ones I mean as bad examples. :-) Yes, I suppose I should fix that. They aren't supposed to be that kind of bad example.
  • d_kennetz
    d_kennetz about 5 years
    Hi! Please add some explanation to your code as to why this is a solution to the problem. Furthermore, this post is pretty old, so you should also add some information as to why your solution adds something that the others have not already addressed.
  • Erik Aronesty
    Erik Aronesty over 4 years
    It's another memory inefficient way
  • Omnifarious
    Omnifarious over 4 years
    @ChadLowe - That makes no sense. I just tested it, and it works fine on a zero length file. What problem did you have? Or did it just look wrong, and so you had to fix it? There is no reason the iterator has to yield at least once. It will just never call update, and that's the exact same result as if you feed update a single empty string.
  • Chad Lowe
    Chad Lowe over 4 years
    You are correct. I'm not sure what I was doing before, but your code works as expected now. Just goes to show, I should always look at my own code for the problem first ;)
  • breakthewall
    breakthewall almost 4 years
    One-line solution. Perfect for a couple of tests!
  • alper
    alper over 2 years
    How could I decode the hex string ? It differs from the output of what md5sum returns