Reliable way to (programmatically) compare PDFs?

16,123

Solution 1

There is quite a few software products that claim to diff pdfs. I've never had need to use one but if this is going to be a recurring process I think it'd be wise for your company to invest in one of them. Just Google "pdf diff" for a bunch of potential applications.

Additionally, your situation is very similar to this question: Tool to compare large numbers of PDF files? I think its discussion may help.

Solution 2

I am a developer of Docotic.Pdf Library. We use PDF comparison in unit tests for checking that test produces PDF as expected. PDF is a collection of special objects and we compare all PDF objects ignoring some properties like trailer IDs and creator info. This implementation works fine.

You can try the method PdfDocument.DocumentsAreEqual. This method just tell you are documents equal, without specific differences. You may contact us if you need more functionality.

Solution 3

I went the approach to getting the raw data out of the PDF, then making use of Word or TortiseSVN, or WinMerge, etc...to take care of the comparison piece. In my instance I did the comparison in a RichTextBox in C#...coloring the differences, etc...since we wanted it all within our app.

Here is what I did... PDF comparison as I was trying to compare mixed documents, Word and PDF.

However I would recommend PDFBox for the parsing, a bit more elegant...although iTextSharp worked out ok...

Solution 4

I wrote a blog suggesting some approaches to comparing PDF files at https://blog.idrsolutions.com/2010/09/comparing-2-pdf-files/

Share:
16,123
JohnIdol
Author by

JohnIdol

The world will be mine but then I'll be killed by my own AI. Or by a giant simulated C. elegans out of the OpenWorm project. You can follow me on twitter.

Updated on June 29, 2022

Comments

  • JohnIdol
    JohnIdol almost 2 years

    Possible Duplicate:
    Tool to compare large numbers of PDF files?

    I am in the classic scenario where the business gives you a bunch of new pdf forms for the new year with no revision notes whatsoever and you are supposed to figure out what's different from the previous year ones.

    I am talking loads of forms here, so I am trying to find a way to compare PDFs to outline differences without having people to manually go through each and every one of them.

    My idea was to extract all the text from the PDFs and dump it into a .txt then run differences on text files, but it sounds horrible.

    My question says programmatically, but I'd be happy with any reliable tools for comparing PDFs, and mainly looking to get an idea from people experiences. Also willing to entertain any programmatic solutions (preferably in C# but pls shoot out any ideas).

  • JohnIdol
    JohnIdol over 13 years
    thanks for that - that question is indeed very similar (for some reason didn't pop up when I composed mine).
  • vsingh
    vsingh over 13 years
    convert pdf to image and then compare and still need human intervention ? How is this useful then ?
  • mark stephens
    mark stephens over 13 years
    The software can tell you if they have not changed so you know you have not broken anything. Only a human can evaluate any changes.