Verifying that two files are identical using pure PHP?

php performance file-upload file-io

23,635

Solution 1

If you already have one SHA1 sum, you can simply do:

if ($known_sha1 == sha1_file($new_file))

otherwise

if (filesize($file_a) == filesize($file_b)
    && md5_file($file_a) == md5_file($file_b)
)

Checking file size too, to somewhat prevent a hash collision (which is already very unlikely). Also using MD5 because it's significantly faster than the SHA algorithms (but a little less unique).

Update:

This is how to exactly compare two files against each other.

function compareFiles($file_a, $file_b)
{
    if (filesize($file_a) != filesize($file_b))
        return false;

    $chunksize = 4096;
    $fp_a = fopen($file_a, 'rb');
    $fp_b = fopen($file_b, 'rb');
        
    while (!feof($fp_a) && !feof($fp_b))
    {
        $d_a = fread($fp_a, $chunksize)
        $d_b = fread($fp_b, $chunksize);
        if ($d_a === false || $d_b === false || $d_a !== $d_b)
        {
            fclose($fp_a);
            fclose($fp_b);
            return false;
        }
    }
 
    fclose($fp_a);
    fclose($fp_b);
          
    return true;
}

Solution 2

Update

If you want to make sure that files are equal then you should first check the file sizes and if they match then just diff the file content. This is much faster than using a hash function and will definitely give the correct result.

It is not required to load the whole file content into memory if you hash the contents using md5_file() or sha1_file() or another hash_function. Here comes an example using md5:

$hash = md5_file('big.file'); // big.file is 1GB  in my test
var_dump(memory_get_peak_usage());

Output:

int(330540)

In your example it would be:

if(md5_file('FILEA') === md5_file('FILEB')) {
    echo 'files are equal';
}

Further note, when you use a hash function you'll always have a situation where you need to decide between complexity on the one hand and the probability of collisions (meaning that two different messages produce the same hash) on the other hand.

Solution 3

When your files are big and binary, you can just test a few bytes of it from a few offsets. It should be much faster than any hashing function, especially that the function returns result by the first different character.

However, this method won't work for files with only a few differend characters. It's the best for big archives, videos and so on.

function areFilesEqual($filename1, $filename2, $accuracy)
{

    $filesize1 = filesize($filename1);
    $filesize2 = filesize($filename2);

    if ($filesize1===$filesize2) {

        $file1 = fopen($filename1, 'r');
        $file2 = fopen($filename2, 'r');

        for ($i=0; $i<$filesize1 && $i<$filesize2; $i+=$accuracy) {
            fseek($file1, $i);
            fseek($file2, $i);
            if (fgetc($file1)!==fgetc($file2)) return false;
        }

        fclose($file1);
        fclose($file2);

        return true;
    }

    return false;
}

Solution 4

Use Sha1 hash, just like you do. If they are equal, compare their md5 hashs and filesize also. If you THEN encounter a file that matches in all 3 checks, but is NOT equal - you just found the holy grail :D

Solution 5

So I came across this then found a question that answers it and really works.

2021... Things change so I figure I will post a link to that answer Here

A) Basically it uses fopen and fread as shown above but it works. The accepted answer always was returning different for me, even on the same file.

B) The fopen and fread method will be faster than sha1 or md5 methods if you can use it and I don't see why you couldn't.

Svish's Version from the link above....

function files_are_equal($a, $b)
{
  // Check if filesize is different
  if(filesize($a) !== filesize($b))
      return false;

  // Check if content is different
  $ah = fopen($a, 'rb');
  $bh = fopen($b, 'rb');

  $result = true;
  while(!feof($ah))
  {
    if(fread($ah, 8192) != fread($bh, 8192))
    {
      $result = false;
      break;
    }
  }

  fclose($ah);
  fclose($bh);

  return $result;
}

View more solutions

23,635

Mikko Rantalainen

My daily work is a closed source PHP project but I'm really interested in open source projects and I know PHP, C/C++, JavaScript and Perl 5 pretty well. I can do some Java, Python, x86 assembler (intel syntax) and some other programming languages, too. Purely functional languages such as Haskell are still a bit hard for me. I can do some linux kernel programming, too. I'm currently running Ubuntu (workstation, home computer, laptop) and LineageOS (phone) as my OS of choice. GPG: 563168EB

Updated on December 25, 2021

Comments

Mikko Rantalainen over 2 years

TL;DR: I have an CMS system that stores attachments (opaque files) using SHA-1 of the file contents as the filename. How to verify if uploaded file really matches one in the storage, given that I already know that SHA-1 hash matches for both files? I'd like to have high performance.

Long version:

When an user uploads a new file to the system, I compute SHA-1 hash of the uploaded file contents and then check if a file with identical hash already exists in the storage backend. PHP puts the uploaded file in /tmp before my code gets to run and then I run sha1sum against the uploaded file to get SHA-1 hash of the file contents. I then compute fanout from the computed SHA-1 hash and decide storage directory under NFS mounted directory hierarchy. (For example, if the SHA-1 hash for a file contents is 37aefc1e145992f2cc16fabadcfe23eede5fb094 the permanent file name is /nfs/data/files/37/ae/fc1e145992f2cc16fabadcfe23eede5fb094.) In addition to saving the actual file contents, I INSERT a new line into a SQL database for the user submitted meta data (e.g. Content-Type, original filename, datestamp, etc).

The corner case I'm currently figuring out is the case where a new uploaded file has SHA-1 hash that matches existing hash in the storage backend. I know that the changes for this happening by accident are astronomically low, but I'd like to be sure. (For on purpose case, see https://shattered.io/)

Given two filenames $file_a and $file_b, how to quickly check if both files have identical contents? Assume that files are too big to be loaded into memory. With Python, I'd use filecmp.cmp() but PHP does not seem to have anything similar. I know that this can be done with fread() and aborting if a non-matching byte is found, but I'd rather not write that code.
- jlew over 10 years
  
  Are you trying to hedge against hash collisions?
- Alma Do over 10 years
  
  Using hash is a good idea. As you've mentioned, probability of collision is astronomically low - so you can be sure in common case, that it will be ok. If not - let us know your case with content of those files :p
- Kakawait over 10 years
  
  git is using sha1 so I think you'r a safe enough to use sha1 :)
- Mikko Rantalainen over 10 years
  
  I'm trying to avoid possibly losing the file contents because of a hash collision. And yes, if I ever see a collision, I'll keep both files. I would bet that in that case I will find that my permanent storage has bitrotted. (The changes of getting a random bit error on any storage device seems much higher than finding SHA-1 collision; I'd like to have a new copy of the corrupted file in this case, still.)
- Mikko Rantalainen over 10 years
  
  @Kakawait: git also does compare-by-bytes test before trusting that the file is identical just because SHA-1 hash happens to match, as far as I know.
- Kakawait over 10 years
  
  Thanks @Mikko Rantalainen for this information. I didn't know
- invisal over 10 years
  
  Maybe you can use another hashing function to compare if the other one produce the same result.
Mikko Rantalainen over 10 years

@hek2mgl: thanks, I didn't know that PHP implementation was sane enough to not read the whole file into the memory. I don't need to use shell_exec() and sha1sum anymore to handle big files.
hek2mgl over 10 years

Yeah they are often forgotten :) .. Also have a look to other maybe faster hash functions. But these have to be called using shell_exec() again
Mikko Rantalainen over 10 years

The difference between MD5 and SHA-1 is easily dwarfed by the IO required to actually get the bits from the storage. The permanent file storage is mounted with NFS using 1Gbps connection, which is obviously the bottleneck for hashing the whole file.
Mikko Rantalainen over 10 years

I'm already checking the file hashes (SHA-1). The corner case I'm trying to figure out is verifying that all the bytes match if SHA-1 hashes match and the file size is identical. I know that the changes for this happening is really low, but the code required to avoid even that low change is not that hard to write.
Mikko Rantalainen over 10 years

I do one SHA-1 already to avoid comparing all the files in the permanent storage. Doing an another hash would get me nowhere because SHA-1 is already pretty good hash and the only way to get obviously better results is to compare the actual bytes. Doing any other hash requires re-reading the whole file from the storage and at that point, it makes more sense to compare bytes because if I find a difference, I can stop at the middle of the file, not unlike if I use an another hash function.
Cobra_Fast over 10 years

@MikkoRantalainen I've added code to my answer that exactly compares the two files.
Mikko Rantalainen over 10 years

You're missing two fclose() calls and the code would look better if you return immediately after failed filesize() test. It's a shame that PHP does not provide such functionality by default.
Mikko Rantalainen over 10 years

I wouldn't claim that files are equal in case md5 hash matches. I would claim that files are probably equal which is the case I already can claim when SHA-1 hashes match.
hek2mgl over 10 years

@MikkoRantalainen If you want to make sure that they are equal hash functions don't suite at all. Use diff .. It is faster and can answer the question
Mikko Rantalainen over 10 years

@hek2mgl hashing is very smart as a first step because the situation is that I have 2e6 files in permanent storage and I receive a new one. I have a list of existing SHA-1 for each stored file so I first compute SHA-1 for the new file. Any match with stored SHA-1 should be considered as candidate match, not a real match.
Mikko Rantalainen over 10 years

Checking only a few random bytes does not give much better results than just trusting that SHA-1 sum. Otherwise, the code looks good if you want just a casual check over file contents.
SwR over 10 years

@Spooky:Ok.The code I posted is suitable for file with few bytes.
Mikko Rantalainen over 10 years

The question already said "given that I already know that SHA-1 hash matches for both files" so it's pretty much safe assumption that I know how to compute the SHA-1 hash (or "checksum"). I also know that files may not be identical despite the fact that SHA-1 hash matches (see stackoverflow.com/questions/2479348/…).
hswner almost 10 years

what about memory and cpu issues? Think about you run this in a loop for several thousands of files. Do you think there will be a memory overhead? We know that there will be only two files being processed on each iteration step, and 4096 * 2 bytes will be consumed for one comparison. But what about cpu time? I tested this function in a loop for 6000 comparisons. After 8 minutes since the time I invoked the script I killed the process, because I didn't even know how longer would it run. On the other hand, the simpler expression sha1_file($file_a) == sha1_file($file_b) performed much better.
Cobra_Fast almost 10 years

@hswner If you want to run my code for several thousand files, then PHP probably is already the wrong choice. You'd be much better off implementing it in C or C++ which will run about 40 times more CPU efficient (at least to my own experience).
hswner almost 10 years

@Cobra_Fast There's no problem with your code. In fact it's how it must be. But hey, why do you take it personal? We're discussing in PHP and considering a usual case, where one might be working on a shared host where she's not going to have any chance to hack some C/C++.
Karl Adler over 8 years

What would be the best practice like for 500 image files with a size of 1Mb - 10Mb? SHA1, MD5 or the direct compare? What's performing best?
Collector almost 7 years

fread($fp_a, 4096) returns empty string "" at EOF. So this loop is infinite. You should add while (!feof($fp_a) && ($b = fread($fp_a, 4096)) !== false)
Mikko Rantalainen about 5 years

The early return code path is missing calls to fclose() in the above code.
Mikko Rantalainen over 3 years

@KarlAdler: if you have e.g. 500 files that you don't know hash or the contents and want to find duplicates, first compare stat()s of those files. If file sizes differ, you don't need to compare contents. If you have only 1-2 possible options for duplicates (that is, identical file size) doing direct file compare using above code is best option. If you have more possible matches (at least 3 files with identical size), doing hashes first to find obvious non-duplicates should reduce total I/O required. If you know that headers of different files will differ, use above code in all cases.
JSG almost 3 years

This never worked for me in 2021 but I knew the idea was correct. The answer here is a working version.... stackoverflow.com/a/3060247/1642731
Mikko Rantalainen over 2 years

-1 this is an incorrect answer and the accepted answer already mentions this alternative (MD5 checksum) + provides the correct answer. The library you linked to literally doesn't compare file contents but filesize + MD5 sum but still requires full IO load for the file so you get all the performance hit without correct result. See here for an example how easy it's to create collisions with MD5 and you'll understand why this is bad idea: stackoverflow.com/q/933497/334451
Jaume Mussons Abad over 2 years

Totally right! The library code has been updated to use the correct method, all unit tests pass exactly the same as using the previous hash method so this is better of course. A new release of the library (7.0.2) has been generated and this answer updated pointing to the new code. Could you please reconsider your comment now?
Mikko Rantalainen over 2 years

It's nice to see that the library has been fixed! I removed the negative vote but the comment cannot be modified anymore (comments in SO can be modified only for 5 minutes).
Jaume Mussons Abad over 2 years

Great, much appreciated!