How does one check huge files identity if hashing is CPU bound?

multi-core hash big-data

8,264

Solution 1

Mine own best at the moment solution is:

parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend '' \ -k -j …NUMofProcessesSay4… md5sum | md5sum

— It should be noted that:

Resulting md5 hash isn't of the file, but rather of md5s of its parts but still it allows you to compare if replica is identical to origin
It also doesn't perform very good, specially when you use pipe and not file as input
parallel's --pipepart as I found out doesn't support disk partitions

So I'd love to hear other ways around as well.

Solution 2

Unfortunately, MD5 is a linear process where its state depend on all the previous input. In other words, you can't truly parallelize it. Moreover, I'm not aware of any real hash alg that does not operate in this manner.

What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.

If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash

Other idea (maybe it is applicable to your intented usage): if you need something faster than MD5 (albeit single-threaded), you can use CRC32 (which is hardware-accelerated by recent CPUs) for a first fast pass, resorting to MD5/SHA1 for a second pass on seemly identical files.

Solution 3

There's pretty much no getting around processing the entire file. MD4 or CRC32 are probably your best bets for a widely deployed and fast algorithm (though CRC32 is going to be far less effective than MD4).

Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.

If you don't really care about interoperability, hashing across multiple cores is easily doable by splitting the file into chunks (doesn't need to be done on-disk, you'd just start reading from specific offsets) and processing each chunk separately (this will result in serious disk thrashing though, degrading the performance, especially for mechanical disks). You'll end up with separate hashes for each chunk (though this has other advantages, like pointing you toward the broken chunk) but you could always hash them together for one final value.

This Gist could be a good start for something in Python.

8,264

poige

UNIX user/sysad for nearly 20 years 6+ years of telecom (ISPs) experience Linux on desktop and servers since 1999. Interleaved with BSDs (Net-/Open-/DragonFly-/FreeBSD), Solaris. Preceded by VAX/VMS. Occasionally I blog about some IT stuff, [иногда на русском][3]. Contact: poige@Telegram.

Updated on September 18, 2022

Comments

poige almost 2 years

For small files hashing is just ok, but with huge ones you can easily find md5sum is CPU bound. Is there any hashing algorithm able to scale out on multiple cores? Any workarounds? Ideas? Anything? :)
- Brian about 8 years
  
  Huges ones is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using GNU Parallel
- Mekong about 8 years
  
  In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel.
- yagmoth555 about 8 years
  
  did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (support.microsoft.com/en-ca/kb/841290)
- poige about 8 years
  
  @yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP.
poige about 8 years

split does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.
Gary about 8 years

@poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.
poige about 8 years

Yeah, but that's the theory which is quite obvious; anything practical?
Gary about 8 years

@poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?
poige about 8 years

I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…
poige about 8 years

The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.
poige about 8 years

BTW: stackoverflow.com/questions/10726325/…
shodanshok about 8 years

@poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).
poige almost 8 years

I won't consider this as correct answer. ;-P
poige almost 8 years

I won't consider this as correct answer. ;-P
Jedi almost 8 years

@shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.
shodanshok almost 8 years

@Jedi sure, it is only a quick workaround for the non parallelizable nature of MD5. It many files are identical, you may end up with slowdown.
poige almost 8 years

From the man page I'd expect -j to support thread per each file given, not its parts.
Ole Tange almost 8 years

Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.
mc0e almost 8 years

I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.
mc0e almost 8 years

This should have been put as part of the question rather than as an answer.