Hashing a file in Python
Solution 1
TL;DR use buffers to not use tons of memory.
We get to the crux of your problem, I believe, when we consider the memory implications of working with very large files. We don't want this bad boy to churn through 2 gigs of ram for a 2 gigabyte file so, as pasztorpisti points out, we gotta deal with those bigger files in chunks!
import sys
import hashlib
# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536 # lets read stuff in 64kb chunks!
md5 = hashlib.md5()
sha1 = hashlib.sha1()
with open(sys.argv[1], 'rb') as f:
while True:
data = f.read(BUF_SIZE)
if not data:
break
md5.update(data)
sha1.update(data)
print("MD5: {0}".format(md5.hexdigest()))
print("SHA1: {0}".format(sha1.hexdigest()))
What we've done is we're updating our hashes of this bad boy in 64kb chunks as we go along with hashlib's handy dandy update method. This way we use a lot less memory than the 2gb it would take to hash the guy all at once!
You can test this with:
$ mkfile 2g bigfile
$ python hashes.py bigfile
MD5: a981130cf2b7e09f4686dc273cf7187e
SHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d
$ md5 bigfile
MD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e
$ shasum bigfile
91d50642dd930e9542c39d36f0516d45f4e1af0d bigfile
Hope that helps!
Also all of this is outlined in the linked question on the right hand side: Get MD5 hash of big files in Python
Addendum!
In general when writing python it helps to get into the habit of following pep-8. For example, in python variables are typically underscore separated not camelCased. But that's just style and no one really cares about those things except people who have to read bad style... which might be you reading this code years from now.
Solution 2
For the correct and efficient computation of the hash value of a file (in Python 3):
- Open the file in binary mode (i.e. add
'b'
to the filemode) to avoid character encoding and line-ending conversion issues. - Don't read the complete file into memory, since that is a waste of memory. Instead, sequentially read it block by block and update the hash for each block.
- Eliminate double buffering, i.e. don't use buffered IO, because we already use an optimal block size.
- Use
readinto()
to avoid buffer churning.
Example:
import hashlib
def sha256sum(filename):
h = hashlib.sha256()
b = bytearray(128*1024)
mv = memoryview(b)
with open(filename, 'rb', buffering=0) as f:
while n := f.readinto(mv):
h.update(mv[:n])
return h.hexdigest()
Note that the while loop uses an assignment expression which isn't available in Python versions older than 3.8.
With older Python 3 versions you can use an equivalent variation:
import hashlib
def sha256sum(filename):
h = hashlib.sha256()
b = bytearray(128*1024)
mv = memoryview(b)
with open(filename, 'rb', buffering=0) as f:
for n in iter(lambda : f.readinto(mv), 0):
h.update(mv[:n])
return h.hexdigest()
Solution 3
I would propose simply:
def get_digest(file_path):
h = hashlib.sha256()
with open(file_path, 'rb') as file:
while True:
# Reading is buffered, so we can read smaller chunks.
chunk = file.read(h.block_size)
if not chunk:
break
h.update(chunk)
return h.hexdigest()
All other answers here seem to complicate too much. Python is already buffering when reading (in ideal manner, or you configure that buffering if you have more information about underlying storage) and so it is better to read in chunks the hash function finds ideal which makes it faster or at lest less CPU intensive to compute the hash function. So instead of disabling buffering and trying to emulate it yourself, you use Python buffering and control what you should be controlling: what the consumer of your data finds ideal, hash block size.
Solution 4
Here is a Python 3, POSIX solution (not Windows!) that uses mmap
to map the object into memory.
import hashlib
import mmap
def sha256sum(filename):
h = hashlib.sha256()
with open(filename, 'rb') as f:
with mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) as mm:
h.update(mm)
return h.hexdigest()
Solution 5
I have programmed a module wich is able to hash big files with different algorithms.
pip3 install py_essentials
Use the module like this:
from py_essentials import hashing as hs
hash = hs.fileChecksum("path/to/the/file.txt", "sha256")
user3358300
Updated on July 08, 2022Comments
-
user3358300 almost 2 years
I want python to read to the EOF so I can get an appropriate hash, whether it is sha1 or md5. Please help. Here is what I have so far:
import hashlib inputFile = raw_input("Enter the name of the file:") openedFile = open(inputFile) readFile = openedFile.read() md5Hash = hashlib.md5(readFile) md5Hashed = md5Hash.hexdigest() sha1Hash = hashlib.sha1(readFile) sha1Hashed = sha1Hash.hexdigest() print "File Name: %s" % inputFile print "MD5: %r" % md5Hashed print "SHA1: %r" % sha1Hashed
-
isedev about 10 yearsand what is the problem?
-
user3358300 about 10 yearsI want it to be able to hash a file. I need it to read until the EOF, whatever the file size may be.
-
isedev about 10 yearsthat is exactly what
file.read()
does - read the entire file. -
Ignacio Vazquez-Abrams about 10 yearsThe documentation for the
read()
method says? -
Sharif Mamun about 10 yearsYou should go through "what is hashing?".
-
user3358300 about 10 yearsWith the code I have it reads and hashes the file but I verified it and the hash given by my program is wrong. I have read on here in similar cases that it must go through a loop in order to read the whole file but I can't figure out how to make it work for my code. Take this one for example: stackoverflow.com/questions/1131220/…
-
Randall Hunt about 10 years@user3358300 you may want to take a look at the code I've shown in my answer below. I think it may help.
-
user324747 about 4 yearsHow can I get the SHA256 hash of a large file in Python2 that will match the ones provided in ASC files?
-
user324747 about 4 years
-
user9811991 over 3 yearsSHA1 should not be used anymore because it has been proven to be possible to generate multiple files with the same SHA1 hash. SHA256 and SHA3 are considered far more secure.
-
Martin Thoma over 2 yearsBy the way: There is a command line tool called
sha256sum
. Just in case somebody just wants to apply it to a single file
-
-
Belial almost 9 years@ranman Hello, I couldn't get the {0}".format(sha1.hexdigest()) part. Why do we use it instead of just using sha1.hexdigest() ?
-
Randall Hunt over 8 years@Belial What wasn't working? I was mainly just using that to differentiate between the two hashes...
-
Belial over 8 years@ranman Everything is working, I just never used this and haven't seen it in the literature. "{0}".format() ... unknown to me. :)
-
Martin Thoma almost 7 yearsHow should I choose
BUF_SIZE
? -
TheRealFakeNews over 6 years@ranman If you had n files, what would be the run time? I'm curious how the buffer size affects it.
-
Randall Hunt over 6 yearsAFIAK the asymptotic (like BigO style) runtime is not different for N files when using buffers vs when not using buffers. The real runtime may indeed be different though. It can take longer to allocate larger buffers but allocating a buffer also has a fixed constant cost of asking the operating system to do something for you. You'd have to experiment to find something optimal. It might be worth it to have one thread going through and getting the file sizes and setting up an optimal buffer size map as you're iterating through your files. Beware premature optimizations though!
-
Mitar about 6 yearsHow do you know what is an optimal block size?
-
maxschlepzig about 6 years@Mitar, a lower bound is the maximum of the physical block (traditionally 512 bytes or 4KiB with newer disks) and the systems page size (4KiB on many system, other common choices: 8KiB and 64 KiB). Then you basically do some benchmarking and/or look at published benchmark results and related work (e.g. check what current rsync/GNU cp/... use).
-
jpmc26 about 6 yearsWould
resource.getpagesize
be of any use here, if we wanted to try to optimize it somewhat dynamically? And what aboutmmap
? -
maxschlepzig about 6 years@jpmc26, getpagesize() isn't that useful here - common values are 4 KiB or 8 KiB, something in that range, i.e. something much smaller than the 128 KiB - 128 KiB is generally a good choice. mmap doesn't help much in our use case as we sequentially read the complete file from front to back. mmap has advantages when the access pattern is more random-access like, if pages are accessed more than once and/or if it the mmap simplifies read buffer management.
-
Robert Hafner over 5 yearsUnlike the "top voted" answer this answer actually provides the same results as the
shasum
function. -
Robert Hafner over 5 yearsThis does doesn't generate the same results as the
shasum
binaries. The other answer listed below (the one using memoryview) is compatible with other hashing tools. -
Kyle A about 5 years@tedivm, that's probably because this answer is using sha256 while Randall's answer uses sha1 and md5, which are the hashing algorithms specified by the OP. Try comparing their results to
sha1sum
andmd5sum
. -
bioinfornatics almost 5 yearsI benchmarked both the solution of (1) @Randall Hunt and (2) yours (in this order, is important due to file cache) with a file of around 116GB and sha1sum algorithm. Solution 1 was modified in order to use a buffer of 20 * 4096 (PAGE_SIZE) and set buffering parameter to 0. Solution 2 only algorithm was modified (sha256 -> sha1). Result: (1) 3m37.137s (2) 3m30.003s . The native sha1sum in binary mode: 3m31.395s
-
Murmel over 4 years@tedivm Sure? Tested it with Python2/3 and got the same results compared to sha1sum and md5sum
-
Murmel over 4 yearsPerfect answer, but it would be nice, if you would back your statements with the related doc: Python3 - open() and Python2 - open(). Even mind the diff between both, Python3's approach is more sophisticated. Nevertheless, I really appreciated the consumer-centric perspective!
-
Murmel over 4 yearsYou are basically doing
echo $USER_INPUT | md5sum > encrypted.txt && cat encrypted.txt
which does not deal with hashing of files, especially not with big ones. -
Murmel over 4 yearsThis might be a good solution for specific use cases (equally sized files, time to do benchmarking), but I miss a note about
open()
already using buffering on its own which might be the best option for a general purpose implementation. See Mitar's answer for more -
Murmel over 4 years@RandallHunt What about using the hash's block size as buffer size, like Mitar does?
-
maxschlepzig over 4 years@Murmel what do you mean with 'equally sized files'? This answer is a general purpose solution. If you call
open()
withbuffering=0
it doesn't do any buffering. Mitar's answer implements buffer churning. -
maxschlepzig over 4 years
hash.block_size
is documented just as the 'internal block size of the hash algorithm'. Hashlib doesn't find it ideal. Nothing in the package documentation suggests thatupdate()
prefershash.block_size
sized input. It doesn't use less CPU if you call it like that. Yourfile.read()
call leads to many unnecessary object creations and superfluous copies from the file buffer to your new chunk bytes object. -
Mitar over 4 yearsHashes update their state in
block_size
chunks. If you are not providing them in those chunks, they have to buffer and wait for enough data to appear, or split given data into chunks internally. So, you can just handle that on the outside and then you simplify what happens internally. I find this ideal. See for example: stackoverflow.com/a/51335622/252025 -
Randall Hunt over 4 yearsThe original version of this answer was written in 2014 so it's very possible there's a better way of doing things now. I'd just add that benchmarking is probably the most effective method - the open buffer size, filesystem buffer size, and algorithm buffer size are likely all different and simply reading the block size of the hashing algo may not be the most efficient method. If someone tries it all out I'm happy to update the answer.
-
maxschlepzig over 4 yearsThe
block_size
is much smaller than any useful read size. Also, any useful block and read sizes are powers of two. Thus, the read size is divisible by the block size for all reads except possibly the last one. For example, the sha256 block size is 64 bytes. That means thatupdate()
is able to directly process the input without any buffering up to any multiple ofblock_size
. Thus, only if the last read isn't divisible by the block size it has to buffer up to 63 bytes, once. Hence, your last comment is incorrect and doesn't support the claims you are making in your answer. -
Mitar over 4 yearsThe point is that one does not have to optimize buffering because it is already done by Python when reading. So you just have to decide on the amount of looping you want to do when hashing over that existing buffer.
-
bugmenot123 over 4 yearshashing != encrypting
-
Jonathan B. over 3 yearsNaive question ... what is the advantage of using
mmap
in this scenario? -
Antti Haapala -- Слава Україні over 3 years@JonathanB. most methods needlessly create
bytes
objects in memory, and callread
too many or too little times. This will map the file directly into the virtual memory, and hash it from there - the operating system can map the file contents directly from the buffer cache into the reading process. This means this could be faster by a significant factor over this one -
Antti Haapala -- Слава Україні over 3 years@JonathanB. I did the test and the difference is not that significant in this case, we're talking about ~15 % over the naive method.
-
Basj over 3 yearsIs it cross-platform (Linux + Win)? Is it working with Python3? Also is it still maintained?
-
1cedsoda over 3 yearsYes it is cross platform and will still work. Also the other stuff in the package works fine. But I will no longer maintain this package of personal experiments, because it was just a learning for me as a developer.
-
user9811991 over 3 yearsSHA1 should not be used anymore because it has been proven to be possible to generate multiple files with the same SHA1 hash. SHA256 and SHA3 are considered far more secure.
-
Seperman about 3 yearsI benchmarked this vs the read chunk by chunk method. This method took 3GB memory for hashing a 3GB file while maxschlepzig's answer took 12MB. They both roughly took the same amount of time on my Ubuntu box.
-
Antti Haapala -- Слава Україні about 3 years@Seperman you're measuring the RAM usage incorrectly. The memory is still available, the pages are mapped from the buffer cache.
-
Seperman about 3 years@AnttiHaapala That makes sense. How do you recommend I measure the RAM usage of the process on Linux to see the mmap usage vs physical memory usage?
-
Seperman about 3 yearsFor example when I look at Htop, these are some numbers I see: VIRT: 2884M, RES: 2122M. From my understanding RES is the physical RAM that is used.
-
Antti Haapala -- Слава Україні about 3 years@Seperman well yes, that is more appropriate number than VIRT - but I added
os.system("free")
in between several points there and the "available" memory doesn't decrease. -
maxschlepzig almost 3 yearsFWIW, with Python >= 3.8 one can add
mm.madvise(mmap.MADV_SEQUENTIAL)
in order to reduce buffer cache pressure somewhat. -
Gaëtan de Menten over 2 yearsThis solution does not live up to a simple benchmark! On my 1Gb file, it is more than twice as slow (5.38s) as Randall Hunt's answer (2.18s), which is itself very slightly slower than maxschlepzig's answer (2.13s).
-
Gaëtan de Menten over 2 yearsFWIW, using "access=mmap.ACCESS_READ" instead of "prot=mmap.PROT_READ" makes this work on Windows (but it is slightly slower than simply reading in chunks)
-
Kelly Bundy about 2 yearsTo clarify: The only reason you're using
memoryview
is the[:n]
, right? Btw since Python 3.8, maybewhile n := f.readinto(mv):
would be clearer. -
maxschlepzig about 2 years@KellyBundy - yes, to make creating an object that gives access to the first n bytes cheap. Creating a slice from bytearry
b
via ` b[:n]` would yield a memory copy of n bytes. Good point regarding assignment expressions. I created the answer before they were proposed/available, but I agree that they simplify the code and thus updated my answer. -
maxschlepzig about 2 yearsThis reads the complete file into memory for computing the hash - which is ok for very small files but quite wasteful for others. If ou want to compute the hash of an 1 GB file then you need > 1 GB of RAM for just computing the hash. Of course this doesn't scale. Also, you present writing a 5-20 line helper function as disadvantage but then post an example function that consists of 7 lines of code and occupies 24 lines in total ...
-
maxschlepzig about 2 yearsAlso, a more idiomatic way to deal with different hash types is to just call
hashlib.new(hashtype)
instead ofgetattr(hashlib, hashtype)
. That package function already does proper value checking (e.g.ValueError: unsupported hash type xyz
) such that you don't have to re-implement it. -
Kelly Bundy about 2 yearsAlternatively we could read into
b
directly and thenh.update(b[:n] if n < N else b)
, so only the last chunk should get sliced. I tried that (with a 128 MB file), thinking the extramemoryview
"layer" might slow it down slightly, but couldn't reliably measure any difference. -
maxschlepzig about 2 years@KellyBundy memoryview is basically just a pointer and a size integer, so it's a pretty low overhead proxy object. Depending on the source,
readinto()
might return less bytes than requested even in non-EOF situations. -
Kelly Bundy about 2 yearsYes, that's what I meant with "should". Ideally only the last, and also the few times I checked, it always was only the last.
-
Sheikh Artin about 2 years@maxschlepzig You mentioned two good things about both the performance and the
hashlib.new
, thanks! But do you suggest a better way to handle this situation? Any tool or function?! -
maxschlepzig about 2 yearsWell, I posted an answer that demonstrates how to hash large (or small) files while only using constant memory.