Verifying that two files are identical using pure PHP?
Solution 1
If you already have one SHA1 sum, you can simply do:
if ($known_sha1 == sha1_file($new_file))
otherwise
if (filesize($file_a) == filesize($file_b)
&& md5_file($file_a) == md5_file($file_b)
)
Checking file size too, to somewhat prevent a hash collision (which is already very unlikely). Also using MD5 because it's significantly faster than the SHA algorithms (but a little less unique).
Update:
This is how to exactly compare two files against each other.
function compareFiles($file_a, $file_b)
{
if (filesize($file_a) != filesize($file_b))
return false;
$chunksize = 4096;
$fp_a = fopen($file_a, 'rb');
$fp_b = fopen($file_b, 'rb');
while (!feof($fp_a) && !feof($fp_b))
{
$d_a = fread($fp_a, $chunksize)
$d_b = fread($fp_b, $chunksize);
if ($d_a === false || $d_b === false || $d_a !== $d_b)
{
fclose($fp_a);
fclose($fp_b);
return false;
}
}
fclose($fp_a);
fclose($fp_b);
return true;
}
Solution 2
Update
If you want to make sure that files are equal then you should first check the file sizes and if they match then just diff the file content. This is much faster than using a hash function and will definitely give the correct result.
It is not required to load the whole file content into memory if you hash the contents using md5_file()
or sha1_file()
or another hash_function. Here comes an example using md5
:
$hash = md5_file('big.file'); // big.file is 1GB in my test
var_dump(memory_get_peak_usage());
Output:
int(330540)
In your example it would be:
if(md5_file('FILEA') === md5_file('FILEB')) {
echo 'files are equal';
}
Further note, when you use a hash function you'll always have a situation where you need to decide between complexity on the one hand and the probability of collisions (meaning that two different messages produce the same hash) on the other hand.
Solution 3
When your files are big and binary, you can just test a few bytes of it from a few offsets. It should be much faster than any hashing function, especially that the function returns result by the first different character.
However, this method won't work for files with only a few differend characters. It's the best for big archives, videos and so on.
function areFilesEqual($filename1, $filename2, $accuracy)
{
$filesize1 = filesize($filename1);
$filesize2 = filesize($filename2);
if ($filesize1===$filesize2) {
$file1 = fopen($filename1, 'r');
$file2 = fopen($filename2, 'r');
for ($i=0; $i<$filesize1 && $i<$filesize2; $i+=$accuracy) {
fseek($file1, $i);
fseek($file2, $i);
if (fgetc($file1)!==fgetc($file2)) return false;
}
fclose($file1);
fclose($file2);
return true;
}
return false;
}
Solution 4
Use Sha1 hash, just like you do. If they are equal, compare their md5 hashs and filesize also. If you THEN encounter a file that matches in all 3 checks, but is NOT equal - you just found the holy grail :D
Solution 5
So I came across this then found a question that answers it and really works.
2021... Things change so I figure I will post a link to that answer Here
A) Basically it uses fopen
and fread
as shown above but it works. The accepted answer always was returning different for me, even on the same file.
B) The fopen
and fread
method will be faster than sha1 or md5 methods if you can use it and I don't see why you couldn't.
Svish's Version from the link above....
function files_are_equal($a, $b)
{
// Check if filesize is different
if(filesize($a) !== filesize($b))
return false;
// Check if content is different
$ah = fopen($a, 'rb');
$bh = fopen($b, 'rb');
$result = true;
while(!feof($ah))
{
if(fread($ah, 8192) != fread($bh, 8192))
{
$result = false;
break;
}
}
fclose($ah);
fclose($bh);
return $result;
}
Related videos on Youtube
Mikko Rantalainen
My daily work is a closed source PHP project but I'm really interested in open source projects and I know PHP, C/C++, JavaScript and Perl 5 pretty well. I can do some Java, Python, x86 assembler (intel syntax) and some other programming languages, too. Purely functional languages such as Haskell are still a bit hard for me. I can do some linux kernel programming, too. I'm currently running Ubuntu (workstation, home computer, laptop) and LineageOS (phone) as my OS of choice. GPG: 563168EB
Updated on December 25, 2021Comments
-
Mikko Rantalainen over 2 years
TL;DR: I have an CMS system that stores attachments (opaque files) using SHA-1 of the file contents as the filename. How to verify if uploaded file really matches one in the storage, given that I already know that SHA-1 hash matches for both files? I'd like to have high performance.
Long version:
When an user uploads a new file to the system, I compute SHA-1 hash of the uploaded file contents and then check if a file with identical hash already exists in the storage backend. PHP puts the uploaded file in
/tmp
before my code gets to run and then I runsha1sum
against the uploaded file to get SHA-1 hash of the file contents. I then compute fanout from the computed SHA-1 hash and decide storage directory under NFS mounted directory hierarchy. (For example, if the SHA-1 hash for a file contents is37aefc1e145992f2cc16fabadcfe23eede5fb094
the permanent file name is/nfs/data/files/37/ae/fc1e145992f2cc16fabadcfe23eede5fb094
.) In addition to saving the actual file contents, IINSERT
a new line into a SQL database for the user submitted meta data (e.g.Content-Type
, original filename, datestamp, etc).The corner case I'm currently figuring out is the case where a new uploaded file has SHA-1 hash that matches existing hash in the storage backend. I know that the changes for this happening by accident are astronomically low, but I'd like to be sure. (For on purpose case, see https://shattered.io/)
Given two filenames
$file_a
and$file_b
, how to quickly check if both files have identical contents? Assume that files are too big to be loaded into memory. With Python, I'd usefilecmp.cmp()
but PHP does not seem to have anything similar. I know that this can be done withfread()
and aborting if a non-matching byte is found, but I'd rather not write that code.-
jlew over 10 yearsAre you trying to hedge against hash collisions?
-
Alma Do over 10 yearsUsing hash is a good idea. As you've mentioned, probability of collision is astronomically low - so you can be sure in common case, that it will be ok. If not - let us know your case with content of those files :p
-
Kakawait over 10 yearsgit is using sha1 so I think you'r a safe enough to use sha1 :)
-
Mikko Rantalainen over 10 yearsI'm trying to avoid possibly losing the file contents because of a hash collision. And yes, if I ever see a collision, I'll keep both files. I would bet that in that case I will find that my permanent storage has bitrotted. (The changes of getting a random bit error on any storage device seems much higher than finding SHA-1 collision; I'd like to have a new copy of the corrupted file in this case, still.)
-
Mikko Rantalainen over 10 years@Kakawait:
git
also does compare-by-bytes test before trusting that the file is identical just because SHA-1 hash happens to match, as far as I know. -
Kakawait over 10 yearsThanks @Mikko Rantalainen for this information. I didn't know
-
invisal over 10 yearsMaybe you can use another hashing function to compare if the other one produce the same result.
-
-
Mikko Rantalainen over 10 years@hek2mgl: thanks, I didn't know that PHP implementation was sane enough to not read the whole file into the memory. I don't need to use
shell_exec()
andsha1sum
anymore to handle big files. -
hek2mgl over 10 yearsYeah they are often forgotten :) .. Also have a look to other maybe faster hash functions. But these have to be called using
shell_exec()
again -
Mikko Rantalainen over 10 yearsThe difference between MD5 and SHA-1 is easily dwarfed by the IO required to actually get the bits from the storage. The permanent file storage is mounted with NFS using 1Gbps connection, which is obviously the bottleneck for hashing the whole file.
-
Mikko Rantalainen over 10 yearsI'm already checking the file hashes (SHA-1). The corner case I'm trying to figure out is verifying that all the bytes match if SHA-1 hashes match and the file size is identical. I know that the changes for this happening is really low, but the code required to avoid even that low change is not that hard to write.
-
Mikko Rantalainen over 10 yearsI do one SHA-1 already to avoid comparing all the files in the permanent storage. Doing an another hash would get me nowhere because SHA-1 is already pretty good hash and the only way to get obviously better results is to compare the actual bytes. Doing any other hash requires re-reading the whole file from the storage and at that point, it makes more sense to compare bytes because if I find a difference, I can stop at the middle of the file, not unlike if I use an another hash function.
-
Cobra_Fast over 10 years@MikkoRantalainen I've added code to my answer that exactly compares the two files.
-
Mikko Rantalainen over 10 yearsYou're missing two
fclose()
calls and the code would look better if you return immediately after failedfilesize()
test. It's a shame that PHP does not provide such functionality by default. -
Mikko Rantalainen over 10 yearsI wouldn't claim that
files are equal
in case md5 hash matches. I would claim thatfiles are probably equal
which is the case I already can claim when SHA-1 hashes match. -
hek2mgl over 10 years@MikkoRantalainen If you want to make sure that they are equal hash functions don't suite at all. Use
diff
.. It is faster and can answer the question -
Mikko Rantalainen over 10 years@hek2mgl hashing is very smart as a first step because the situation is that I have 2e6 files in permanent storage and I receive a new one. I have a list of existing SHA-1 for each stored file so I first compute SHA-1 for the new file. Any match with stored SHA-1 should be considered as candidate match, not a real match.
-
Mikko Rantalainen over 10 yearsChecking only a few random bytes does not give much better results than just trusting that SHA-1 sum. Otherwise, the code looks good if you want just a casual check over file contents.
-
SwR over 10 years@Spooky:Ok.The code I posted is suitable for file with few bytes.
-
Mikko Rantalainen over 10 yearsThe question already said "given that I already know that SHA-1 hash matches for both files" so it's pretty much safe assumption that I know how to compute the SHA-1 hash (or "checksum"). I also know that files may not be identical despite the fact that SHA-1 hash matches (see stackoverflow.com/questions/2479348/…).
-
hswner almost 10 yearswhat about memory and cpu issues? Think about you run this in a loop for several thousands of files. Do you think there will be a memory overhead? We know that there will be only two files being processed on each iteration step, and 4096 * 2 bytes will be consumed for one comparison. But what about cpu time? I tested this function in a loop for 6000 comparisons. After 8 minutes since the time I invoked the script I killed the process, because I didn't even know how longer would it run. On the other hand, the simpler expression
sha1_file($file_a) == sha1_file($file_b)
performed much better. -
Cobra_Fast almost 10 years@hswner If you want to run my code for several thousand files, then PHP probably is already the wrong choice. You'd be much better off implementing it in C or C++ which will run about 40 times more CPU efficient (at least to my own experience).
-
hswner almost 10 years@Cobra_Fast There's no problem with your code. In fact it's how it must be. But hey, why do you take it personal? We're discussing in PHP and considering a usual case, where one might be working on a shared host where she's not going to have any chance to hack some C/C++.
-
Karl Adler over 8 yearsWhat would be the best practice like for 500 image files with a size of 1Mb - 10Mb? SHA1, MD5 or the direct compare? What's performing best?
-
Collector almost 7 years
fread($fp_a, 4096)
returns empty string""
at EOF. So this loop is infinite. You should addwhile (!feof($fp_a) && ($b = fread($fp_a, 4096)) !== false)
-
Mikko Rantalainen about 5 yearsThe early return code path is missing calls to
fclose()
in the above code. -
Mikko Rantalainen over 3 years@KarlAdler: if you have e.g. 500 files that you don't know hash or the contents and want to find duplicates, first compare stat()s of those files. If file sizes differ, you don't need to compare contents. If you have only 1-2 possible options for duplicates (that is, identical file size) doing direct file compare using above code is best option. If you have more possible matches (at least 3 files with identical size), doing hashes first to find obvious non-duplicates should reduce total I/O required. If you know that headers of different files will differ, use above code in all cases.
-
JSG almost 3 yearsThis never worked for me in 2021 but I knew the idea was correct. The answer here is a working version.... stackoverflow.com/a/3060247/1642731
-
Mikko Rantalainen over 2 years-1 this is an incorrect answer and the accepted answer already mentions this alternative (MD5 checksum) + provides the correct answer. The library you linked to literally doesn't compare file contents but filesize + MD5 sum but still requires full IO load for the file so you get all the performance hit without correct result. See here for an example how easy it's to create collisions with MD5 and you'll understand why this is bad idea: stackoverflow.com/q/933497/334451
-
Jaume Mussons Abad over 2 yearsTotally right! The library code has been updated to use the correct method, all unit tests pass exactly the same as using the previous hash method so this is better of course. A new release of the library (7.0.2) has been generated and this answer updated pointing to the new code. Could you please reconsider your comment now?
-
Mikko Rantalainen over 2 yearsIt's nice to see that the library has been fixed! I removed the negative vote but the comment cannot be modified anymore (comments in SO can be modified only for 5 minutes).
-
Jaume Mussons Abad over 2 yearsGreat, much appreciated!