Comparison of two pdf files
Solution 1
If you prefer a tool with a GUI, you could try this one: diffpdf. It's by Mark Summerfield, and since it's written with Qt, it should be available (or should be buildable) on all platforms where Qt runs on.
Here's a screenshot:
Solution 2
You can do the same thing with a shell script on Linux. The script wraps 3 components:
- ImageMagick's
compare
command - the
pdftk
utility - Ghostscript
It's rather easy to translate this into a .bat
Batch file for DOS/Windows...
Here are the building blocks:
pdftk
Use this command to split multipage PDF files into multiple singlepage PDFs:
pdftk first.pdf burst output somewhere/firstpdf_page_%03d.pdf
pdftk 2nd.pdf burst output somewhere/2ndpdf_page_%03d.pdf
compare
Use this command to create a "diff" PDF page for each of the pages:
compare \
-verbose \
-debug coder -log "%u %m:%l %e" \
somewhere/firstpdf_page_001.pdf \
somewhere/2ndpdf_page_001.pdf \
-compose src \
somewhereelse/diff_page_001.pdf
Note, that compare
is part of ImageMagick. But for PDF processing it needs Ghostscript as a 'delegate', because it cannot do so natively itself.
Once more, pdftk
Now you can again concatenate your "diff" PDF pages with pdftk
:
pdftk \
somewhereelse/diff_page_*.pdf \
cat \
output somewhereelse/diff_allpages.pdf
Ghostscript
Ghostscript automatically inserts meta data (such as the current date+time) into its PDF output. Therefore this is not working well for MD5hash-based file comparisons.
If you want to automatically discover all cases which consist of purely white pages (that means: there are no visible differences in your input pages), you could also convert to a meta-data free bitmap format using the bmp256
output device. You can do that for the original PDFs (first.pdf and 2nd.pdf), or for the diff-PDF pages:
gs \
-o diff_page_001.bmp \
-r72 \
-g595x842 \
-sDEVICE=bmp256 \
diff_page_001.pdf
md5sum diff_page_001.bmp
Just create an all-white BMP page with its MD5sum (for reference) like this:
gs \
-o reference-white-page.bmp \
-r72 \
-g595x842 \
-sDEVICE=bmp256 \
-c "showpage quit"
md5sum reference-white-page.bmp
Solution 3
I had this very problem myself and the quickest way that I've found is to use PHP and its bindings for ImageMagick (Imagick).
<?php
$im1 = new \Imagick("file1.pdf");
$im2 = new \Imagick("file2.pdf");
$result = $im1->compareImages($im2, \Imagick::METRIC_MEANSQUAREERROR);
if($result[1] > 0.0){
// Files are DIFFERENT
}
else{
// Files are IDENTICAL
}
$im1->destroy();
$im2->destroy();
Of course, you need to install the ImageMagick bindings first:
sudo apt-get install php5-imagick # Ubuntu/Debian
Admin
Updated on July 18, 2022Comments
-
Admin almost 2 years
I need to compare the contents of two almost similar files and highlight the dissimilar portions in the corresponding pdf file. Am using pdfbox. Please help me atleast with the logic.
-
scc over 8 yearsI got an error when tried to download that file "The transferred file contained a virus and was therefore blocked. URL: testautomationguru.com/download/304 Media Type: application/java-vm Virus Name: McAfeeGW: BehavesLike.Java.Suspicious.xm"
-
Brecht Machiels about 8 yearsHere's a script to visually diff two PDFs page-by-page using ImageMagick and Poppler tools (for speed): gist.github.com/brechtm/891de9f72516c1b2cbc1. It outputs one JPG for each page of the PDFs in a
pdfdiff
directory and additionally prints the numbers of the pages which differ between the two PDFs. -
Nachiappan R over 7 yearsI try to run the jar file from above mentioned site,but I am getting error like "no main manifest attribute, in taguru-pdf-util.jar", could you please help me on this
-
caw over 7 yearsIs there any chance you could use this on the CLI, skip the GUI and redirect the output directly to a file?
-
Kurt Pfeifle over 7 years@caw: (1) Did you see my other answer? -- (2) AFAIK, newer versions of DiffPDF can redirect output to a CSV file. I don't know if this completely skips the GUI, though. -- (3) There is a "purely-CLI" version of DiffPDF available, called DiffPDFc, to be found at www.qtrac.eu -- however, it is for Windows only.
-
caw over 7 yearsI haven't, but tried ImageMagick,
pdftk
and Ghostscript before. Not in that combination, but separately. Since the results ofdiffpdf
are so good, in fact excellent, I had hoped that all this functionality which is already there could just be used to redirect into a PDF on the CLI. What a pity! Thanks for the information on the other versions of that tool as well. Unfortunately, newer versions are not open-source anymore and Windows-only is not perfect, either. -
snapshot over 5 yearsI need to install also ghostscript
-
Luciano Fantuzzi about 2 yearsThis is the right solution.