Comparison of two pdf files

10,496

Solution 1

If you prefer a tool with a GUI, you could try this one: diffpdf. It's by Mark Summerfield, and since it's written with Qt, it should be available (or should be buildable) on all platforms where Qt runs on.

Here's a screenshot:enter image description here

Solution 2

You can do the same thing with a shell script on Linux. The script wraps 3 components:

  1. ImageMagick's compare command
  2. the pdftk utility
  3. Ghostscript

It's rather easy to translate this into a .bat Batch file for DOS/Windows...

Here are the building blocks:

pdftk

Use this command to split multipage PDF files into multiple singlepage PDFs:

pdftk  first.pdf  burst  output  somewhere/firstpdf_page_%03d.pdf
pdftk  2nd.pdf    burst  output  somewhere/2ndpdf_page_%03d.pdf

compare

Use this command to create a "diff" PDF page for each of the pages:

compare \
       -verbose \
       -debug coder -log "%u %m:%l %e" \
        somewhere/firstpdf_page_001.pdf \
        somewhere/2ndpdf_page_001.pdf \
       -compose src \
        somewhereelse/diff_page_001.pdf

Note, that compare is part of ImageMagick. But for PDF processing it needs Ghostscript as a 'delegate', because it cannot do so natively itself.

Once more, pdftk

Now you can again concatenate your "diff" PDF pages with pdftk:

pdftk \
      somewhereelse/diff_page_*.pdf \
      cat \
      output somewhereelse/diff_allpages.pdf

Ghostscript

Ghostscript automatically inserts meta data (such as the current date+time) into its PDF output. Therefore this is not working well for MD5hash-based file comparisons.

If you want to automatically discover all cases which consist of purely white pages (that means: there are no visible differences in your input pages), you could also convert to a meta-data free bitmap format using the bmp256 output device. You can do that for the original PDFs (first.pdf and 2nd.pdf), or for the diff-PDF pages:

 gs \
   -o diff_page_001.bmp \
   -r72 \
   -g595x842 \
   -sDEVICE=bmp256 \
    diff_page_001.pdf

 md5sum diff_page_001.bmp

Just create an all-white BMP page with its MD5sum (for reference) like this:

 gs \
   -o reference-white-page.bmp \
   -r72 \
   -g595x842 \
   -sDEVICE=bmp256 \
   -c "showpage quit"

 md5sum reference-white-page.bmp

Solution 3

I had this very problem myself and the quickest way that I've found is to use PHP and its bindings for ImageMagick (Imagick).

<?php
$im1 = new \Imagick("file1.pdf");
$im2 = new \Imagick("file2.pdf");

$result = $im1->compareImages($im2, \Imagick::METRIC_MEANSQUAREERROR);

if($result[1] > 0.0){
    // Files are DIFFERENT
}
else{
    // Files are IDENTICAL
}

$im1->destroy();
$im2->destroy();

Of course, you need to install the ImageMagick bindings first:

sudo apt-get install php5-imagick # Ubuntu/Debian
Share:
10,496
Admin
Author by

Admin

Updated on July 18, 2022

Comments

  • Admin
    Admin almost 2 years

    I need to compare the contents of two almost similar files and highlight the dissimilar portions in the corresponding pdf file. Am using pdfbox. Please help me atleast with the logic.

  • scc
    scc over 8 years
    I got an error when tried to download that file "The transferred file contained a virus and was therefore blocked. URL: testautomationguru.com/download/304 Media Type: application/java-vm Virus Name: McAfeeGW: BehavesLike.Java.Suspicious.xm"
  • Brecht Machiels
    Brecht Machiels about 8 years
    Here's a script to visually diff two PDFs page-by-page using ImageMagick and Poppler tools (for speed): gist.github.com/brechtm/891de9f72516c1b2cbc1. It outputs one JPG for each page of the PDFs in a pdfdiff directory and additionally prints the numbers of the pages which differ between the two PDFs.
  • Nachiappan R
    Nachiappan R over 7 years
    I try to run the jar file from above mentioned site,but I am getting error like "no main manifest attribute, in taguru-pdf-util.jar", could you please help me on this
  • caw
    caw over 7 years
    Is there any chance you could use this on the CLI, skip the GUI and redirect the output directly to a file?
  • Kurt Pfeifle
    Kurt Pfeifle over 7 years
    @caw: (1) Did you see my other answer? -- (2) AFAIK, newer versions of DiffPDF can redirect output to a CSV file. I don't know if this completely skips the GUI, though. -- (3) There is a "purely-CLI" version of DiffPDF available, called DiffPDFc, to be found at www.qtrac.eu -- however, it is for Windows only.
  • caw
    caw over 7 years
    I haven't, but tried ImageMagick, pdftk and Ghostscript before. Not in that combination, but separately. Since the results of diffpdf are so good, in fact excellent, I had hoped that all this functionality which is already there could just be used to redirect into a PDF on the CLI. What a pity! Thanks for the information on the other versions of that tool as well. Unfortunately, newer versions are not open-source anymore and Windows-only is not perfect, either.
  • snapshot
    snapshot over 5 years
    I need to install also ghostscript
  • Luciano Fantuzzi
    Luciano Fantuzzi about 2 years
    This is the right solution.