Hashing a file in Python

159,083

Solution 1

TL;DR use buffers to not use tons of memory.

We get to the crux of your problem, I believe, when we consider the memory implications of working with very large files. We don't want this bad boy to churn through 2 gigs of ram for a 2 gigabyte file so, as pasztorpisti points out, we gotta deal with those bigger files in chunks!

import sys
import hashlib

# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536  # lets read stuff in 64kb chunks!

md5 = hashlib.md5()
sha1 = hashlib.sha1()

with open(sys.argv[1], 'rb') as f:
    while True:
        data = f.read(BUF_SIZE)
        if not data:
            break
        md5.update(data)
        sha1.update(data)

print("MD5: {0}".format(md5.hexdigest()))
print("SHA1: {0}".format(sha1.hexdigest()))

What we've done is we're updating our hashes of this bad boy in 64kb chunks as we go along with hashlib's handy dandy update method. This way we use a lot less memory than the 2gb it would take to hash the guy all at once!

You can test this with:

$ mkfile 2g bigfile
$ python hashes.py bigfile
MD5: a981130cf2b7e09f4686dc273cf7187e
SHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d
$ md5 bigfile
MD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e
$ shasum bigfile
91d50642dd930e9542c39d36f0516d45f4e1af0d  bigfile

Hope that helps!

Also all of this is outlined in the linked question on the right hand side: Get MD5 hash of big files in Python


Addendum!

In general when writing python it helps to get into the habit of following pep-8. For example, in python variables are typically underscore separated not camelCased. But that's just style and no one really cares about those things except people who have to read bad style... which might be you reading this code years from now.

Solution 2

For the correct and efficient computation of the hash value of a file (in Python 3):

  • Open the file in binary mode (i.e. add 'b' to the filemode) to avoid character encoding and line-ending conversion issues.
  • Don't read the complete file into memory, since that is a waste of memory. Instead, sequentially read it block by block and update the hash for each block.
  • Eliminate double buffering, i.e. don't use buffered IO, because we already use an optimal block size.
  • Use readinto() to avoid buffer churning.

Example:

import hashlib

def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        while n := f.readinto(mv):
            h.update(mv[:n])
    return h.hexdigest()

Note that the while loop uses an assignment expression which isn't available in Python versions older than 3.8.

With older Python 3 versions you can use an equivalent variation:

import hashlib

def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        for n in iter(lambda : f.readinto(mv), 0):
            h.update(mv[:n])
    return h.hexdigest()

Solution 3

I would propose simply:

def get_digest(file_path):
    h = hashlib.sha256()

    with open(file_path, 'rb') as file:
        while True:
            # Reading is buffered, so we can read smaller chunks.
            chunk = file.read(h.block_size)
            if not chunk:
                break
            h.update(chunk)

    return h.hexdigest()

All other answers here seem to complicate too much. Python is already buffering when reading (in ideal manner, or you configure that buffering if you have more information about underlying storage) and so it is better to read in chunks the hash function finds ideal which makes it faster or at lest less CPU intensive to compute the hash function. So instead of disabling buffering and trying to emulate it yourself, you use Python buffering and control what you should be controlling: what the consumer of your data finds ideal, hash block size.

Solution 4

Here is a Python 3, POSIX solution (not Windows!) that uses mmap to map the object into memory.

import hashlib
import mmap

def sha256sum(filename):
    h  = hashlib.sha256()
    with open(filename, 'rb') as f:
        with mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) as mm:
            h.update(mm)
    return h.hexdigest()

Solution 5

I have programmed a module wich is able to hash big files with different algorithms.

pip3 install py_essentials

Use the module like this:

from py_essentials import hashing as hs
hash = hs.fileChecksum("path/to/the/file.txt", "sha256")
Share:
159,083
user3358300
Author by

user3358300

Updated on July 08, 2022

Comments

  • user3358300
    user3358300 almost 2 years

    I want python to read to the EOF so I can get an appropriate hash, whether it is sha1 or md5. Please help. Here is what I have so far:

    import hashlib
    
    inputFile = raw_input("Enter the name of the file:")
    openedFile = open(inputFile)
    readFile = openedFile.read()
    
    md5Hash = hashlib.md5(readFile)
    md5Hashed = md5Hash.hexdigest()
    
    sha1Hash = hashlib.sha1(readFile)
    sha1Hashed = sha1Hash.hexdigest()
    
    print "File Name: %s" % inputFile
    print "MD5: %r" % md5Hashed
    print "SHA1: %r" % sha1Hashed
    
    • isedev
      isedev about 10 years
      and what is the problem?
    • user3358300
      user3358300 about 10 years
      I want it to be able to hash a file. I need it to read until the EOF, whatever the file size may be.
    • isedev
      isedev about 10 years
      that is exactly what file.read() does - read the entire file.
    • Ignacio Vazquez-Abrams
      Ignacio Vazquez-Abrams about 10 years
      The documentation for the read() method says?
    • Sharif Mamun
      Sharif Mamun about 10 years
      You should go through "what is hashing?".
    • user3358300
      user3358300 about 10 years
      With the code I have it reads and hashes the file but I verified it and the hash given by my program is wrong. I have read on here in similar cases that it must go through a loop in order to read the whole file but I can't figure out how to make it work for my code. Take this one for example: stackoverflow.com/questions/1131220/…
    • Randall Hunt
      Randall Hunt about 10 years
      @user3358300 you may want to take a look at the code I've shown in my answer below. I think it may help.
    • user324747
      user324747 about 4 years
      How can I get the SHA256 hash of a large file in Python2 that will match the ones provided in ASC files?
    • user324747
      user324747 about 4 years
    • user9811991
      user9811991 over 3 years
      SHA1 should not be used anymore because it has been proven to be possible to generate multiple files with the same SHA1 hash. SHA256 and SHA3 are considered far more secure.
    • Martin Thoma
      Martin Thoma over 2 years
      By the way: There is a command line tool called sha256sum. Just in case somebody just wants to apply it to a single file
  • Belial
    Belial almost 9 years
    @ranman Hello, I couldn't get the {0}".format(sha1.hexdigest()) part. Why do we use it instead of just using sha1.hexdigest() ?
  • Randall Hunt
    Randall Hunt over 8 years
    @Belial What wasn't working? I was mainly just using that to differentiate between the two hashes...
  • Belial
    Belial over 8 years
    @ranman Everything is working, I just never used this and haven't seen it in the literature. "{0}".format() ... unknown to me. :)
  • Martin Thoma
    Martin Thoma almost 7 years
    How should I choose BUF_SIZE?
  • TheRealFakeNews
    TheRealFakeNews over 6 years
    @ranman If you had n files, what would be the run time? I'm curious how the buffer size affects it.
  • Randall Hunt
    Randall Hunt over 6 years
    AFIAK the asymptotic (like BigO style) runtime is not different for N files when using buffers vs when not using buffers. The real runtime may indeed be different though. It can take longer to allocate larger buffers but allocating a buffer also has a fixed constant cost of asking the operating system to do something for you. You'd have to experiment to find something optimal. It might be worth it to have one thread going through and getting the file sizes and setting up an optimal buffer size map as you're iterating through your files. Beware premature optimizations though!
  • Mitar
    Mitar about 6 years
    How do you know what is an optimal block size?
  • maxschlepzig
    maxschlepzig about 6 years
    @Mitar, a lower bound is the maximum of the physical block (traditionally 512 bytes or 4KiB with newer disks) and the systems page size (4KiB on many system, other common choices: 8KiB and 64 KiB). Then you basically do some benchmarking and/or look at published benchmark results and related work (e.g. check what current rsync/GNU cp/... use).
  • jpmc26
    jpmc26 about 6 years
    Would resource.getpagesize be of any use here, if we wanted to try to optimize it somewhat dynamically? And what about mmap?
  • maxschlepzig
    maxschlepzig about 6 years
    @jpmc26, getpagesize() isn't that useful here - common values are 4 KiB or 8 KiB, something in that range, i.e. something much smaller than the 128 KiB - 128 KiB is generally a good choice. mmap doesn't help much in our use case as we sequentially read the complete file from front to back. mmap has advantages when the access pattern is more random-access like, if pages are accessed more than once and/or if it the mmap simplifies read buffer management.
  • Robert Hafner
    Robert Hafner over 5 years
    Unlike the "top voted" answer this answer actually provides the same results as the shasum function.
  • Robert Hafner
    Robert Hafner over 5 years
    This does doesn't generate the same results as the shasum binaries. The other answer listed below (the one using memoryview) is compatible with other hashing tools.
  • Kyle A
    Kyle A about 5 years
    @tedivm, that's probably because this answer is using sha256 while Randall's answer uses sha1 and md5, which are the hashing algorithms specified by the OP. Try comparing their results to sha1sum and md5sum.
  • bioinfornatics
    bioinfornatics almost 5 years
    I benchmarked both the solution of (1) @Randall Hunt and (2) yours (in this order, is important due to file cache) with a file of around 116GB and sha1sum algorithm. Solution 1 was modified in order to use a buffer of 20 * 4096 (PAGE_SIZE) and set buffering parameter to 0. Solution 2 only algorithm was modified (sha256 -> sha1). Result: (1) 3m37.137s (2) 3m30.003s . The native sha1sum in binary mode: 3m31.395s
  • Murmel
    Murmel over 4 years
    @tedivm Sure? Tested it with Python2/3 and got the same results compared to sha1sum and md5sum
  • Murmel
    Murmel over 4 years
    Perfect answer, but it would be nice, if you would back your statements with the related doc: Python3 - open() and Python2 - open(). Even mind the diff between both, Python3's approach is more sophisticated. Nevertheless, I really appreciated the consumer-centric perspective!
  • Murmel
    Murmel over 4 years
    You are basically doing echo $USER_INPUT | md5sum > encrypted.txt && cat encrypted.txt which does not deal with hashing of files, especially not with big ones.
  • Murmel
    Murmel over 4 years
    This might be a good solution for specific use cases (equally sized files, time to do benchmarking), but I miss a note about open() already using buffering on its own which might be the best option for a general purpose implementation. See Mitar's answer for more
  • Murmel
    Murmel over 4 years
    @RandallHunt What about using the hash's block size as buffer size, like Mitar does?
  • maxschlepzig
    maxschlepzig over 4 years
    @Murmel what do you mean with 'equally sized files'? This answer is a general purpose solution. If you call open() with buffering=0 it doesn't do any buffering. Mitar's answer implements buffer churning.
  • maxschlepzig
    maxschlepzig over 4 years
    hash.block_size is documented just as the 'internal block size of the hash algorithm'. Hashlib doesn't find it ideal. Nothing in the package documentation suggests that update() prefers hash.block_size sized input. It doesn't use less CPU if you call it like that. Your file.read() call leads to many unnecessary object creations and superfluous copies from the file buffer to your new chunk bytes object.
  • Mitar
    Mitar over 4 years
    Hashes update their state in block_size chunks. If you are not providing them in those chunks, they have to buffer and wait for enough data to appear, or split given data into chunks internally. So, you can just handle that on the outside and then you simplify what happens internally. I find this ideal. See for example: stackoverflow.com/a/51335622/252025
  • Randall Hunt
    Randall Hunt over 4 years
    The original version of this answer was written in 2014 so it's very possible there's a better way of doing things now. I'd just add that benchmarking is probably the most effective method - the open buffer size, filesystem buffer size, and algorithm buffer size are likely all different and simply reading the block size of the hashing algo may not be the most efficient method. If someone tries it all out I'm happy to update the answer.
  • maxschlepzig
    maxschlepzig over 4 years
    The block_size is much smaller than any useful read size. Also, any useful block and read sizes are powers of two. Thus, the read size is divisible by the block size for all reads except possibly the last one. For example, the sha256 block size is 64 bytes. That means that update() is able to directly process the input without any buffering up to any multiple of block_size. Thus, only if the last read isn't divisible by the block size it has to buffer up to 63 bytes, once. Hence, your last comment is incorrect and doesn't support the claims you are making in your answer.
  • Mitar
    Mitar over 4 years
    The point is that one does not have to optimize buffering because it is already done by Python when reading. So you just have to decide on the amount of looping you want to do when hashing over that existing buffer.
  • bugmenot123
    bugmenot123 over 4 years
    hashing != encrypting
  • Jonathan B.
    Jonathan B. over 3 years
    Naive question ... what is the advantage of using mmap in this scenario?
  • Antti Haapala -- Слава Україні
    Antti Haapala -- Слава Україні over 3 years
    @JonathanB. most methods needlessly create bytes objects in memory, and call read too many or too little times. This will map the file directly into the virtual memory, and hash it from there - the operating system can map the file contents directly from the buffer cache into the reading process. This means this could be faster by a significant factor over this one
  • Antti Haapala -- Слава Україні
    Antti Haapala -- Слава Україні over 3 years
    @JonathanB. I did the test and the difference is not that significant in this case, we're talking about ~15 % over the naive method.
  • Basj
    Basj over 3 years
    Is it cross-platform (Linux + Win)? Is it working with Python3? Also is it still maintained?
  • 1cedsoda
    1cedsoda over 3 years
    Yes it is cross platform and will still work. Also the other stuff in the package works fine. But I will no longer maintain this package of personal experiments, because it was just a learning for me as a developer.
  • user9811991
    user9811991 over 3 years
    SHA1 should not be used anymore because it has been proven to be possible to generate multiple files with the same SHA1 hash. SHA256 and SHA3 are considered far more secure.
  • Seperman
    Seperman about 3 years
    I benchmarked this vs the read chunk by chunk method. This method took 3GB memory for hashing a 3GB file while maxschlepzig's answer took 12MB. They both roughly took the same amount of time on my Ubuntu box.
  • Antti Haapala -- Слава Україні
    Antti Haapala -- Слава Україні about 3 years
    @Seperman you're measuring the RAM usage incorrectly. The memory is still available, the pages are mapped from the buffer cache.
  • Seperman
    Seperman about 3 years
    @AnttiHaapala That makes sense. How do you recommend I measure the RAM usage of the process on Linux to see the mmap usage vs physical memory usage?
  • Seperman
    Seperman about 3 years
    For example when I look at Htop, these are some numbers I see: VIRT: 2884M, RES: 2122M. From my understanding RES is the physical RAM that is used.
  • Antti Haapala -- Слава Україні
    Antti Haapala -- Слава Україні about 3 years
    @Seperman well yes, that is more appropriate number than VIRT - but I added os.system("free") in between several points there and the "available" memory doesn't decrease.
  • maxschlepzig
    maxschlepzig almost 3 years
    FWIW, with Python >= 3.8 one can add mm.madvise(mmap.MADV_SEQUENTIAL) in order to reduce buffer cache pressure somewhat.
  • Gaëtan de Menten
    Gaëtan de Menten over 2 years
    This solution does not live up to a simple benchmark! On my 1Gb file, it is more than twice as slow (5.38s) as Randall Hunt's answer (2.18s), which is itself very slightly slower than maxschlepzig's answer (2.13s).
  • Gaëtan de Menten
    Gaëtan de Menten over 2 years
    FWIW, using "access=mmap.ACCESS_READ" instead of "prot=mmap.PROT_READ" makes this work on Windows (but it is slightly slower than simply reading in chunks)
  • Kelly Bundy
    Kelly Bundy about 2 years
    To clarify: The only reason you're using memoryview is the [:n], right? Btw since Python 3.8, maybe while n := f.readinto(mv): would be clearer.
  • maxschlepzig
    maxschlepzig about 2 years
    @KellyBundy - yes, to make creating an object that gives access to the first n bytes cheap. Creating a slice from bytearry b via ` b[:n]` would yield a memory copy of n bytes. Good point regarding assignment expressions. I created the answer before they were proposed/available, but I agree that they simplify the code and thus updated my answer.
  • maxschlepzig
    maxschlepzig about 2 years
    This reads the complete file into memory for computing the hash - which is ok for very small files but quite wasteful for others. If ou want to compute the hash of an 1 GB file then you need > 1 GB of RAM for just computing the hash. Of course this doesn't scale. Also, you present writing a 5-20 line helper function as disadvantage but then post an example function that consists of 7 lines of code and occupies 24 lines in total ...
  • maxschlepzig
    maxschlepzig about 2 years
    Also, a more idiomatic way to deal with different hash types is to just call hashlib.new(hashtype) instead of getattr(hashlib, hashtype). That package function already does proper value checking (e.g. ValueError: unsupported hash type xyz) such that you don't have to re-implement it.
  • Kelly Bundy
    Kelly Bundy about 2 years
    Alternatively we could read into b directly and then h.update(b[:n] if n < N else b), so only the last chunk should get sliced. I tried that (with a 128 MB file), thinking the extra memoryview "layer" might slow it down slightly, but couldn't reliably measure any difference.
  • maxschlepzig
    maxschlepzig about 2 years
    @KellyBundy memoryview is basically just a pointer and a size integer, so it's a pretty low overhead proxy object. Depending on the source, readinto() might return less bytes than requested even in non-EOF situations.
  • Kelly Bundy
    Kelly Bundy about 2 years
    Yes, that's what I meant with "should". Ideally only the last, and also the few times I checked, it always was only the last.
  • Sheikh Artin
    Sheikh Artin about 2 years
    @maxschlepzig You mentioned two good things about both the performance and the hashlib.new, thanks! But do you suggest a better way to handle this situation? Any tool or function?!
  • maxschlepzig
    maxschlepzig about 2 years
    Well, I posted an answer that demonstrates how to hash large (or small) files while only using constant memory.