Performance iText vs.PdfBox

32,392

My question is in what the performance depends, is there a way how to make PdfBox faster?

One major difference is that PDFBox always processes text glyph by glyph while iText normally processes it chunk (i.e. single string parameter of text drawing operation) by chunk; that reduces the required resources in iText quite a lot. Furthermore the event oriented architecture of iText text parsing means a lower burden on resources than that of PDFBox. And PDFBox keeps information not strictly required for plain text extraction available for a longer time, costing more resources.

But the way the libraries initially load the document may also make a difference. Here you can experiment a bit, PDFBox not only offers multiple PDDocument.load overloads but also some PDDocument.loadNonSeq overloads (actually PDDocument.loadNonSeq reads documents correctly while PDDocument.load can be tricked to misinterpret PDFs). All these different variants may have different runtime behavior.

more about how strategies affect performance?

iText brings along a simple and a more advanced text extraction strategy. The simple one assumes text in the page content stream to appear in reading order while the more advanced one sorts. By default the more advanced one is used. Thus, you probably can speed up iText even some more by using the simple strategy. PDFBox always sorts.

Share:
32,392
meilechh
Author by

meilechh

Updated on July 09, 2022

Comments

  • meilechh
    meilechh almost 2 years

    I'm trying to convert a pdf (my favorite book Effective Java, if its matter)to text, i checked both iText and Apache PdfBox. I see a really big difference in performance: With iText it took 2:521, and with PdfBox: 6:117. This if my code for PdfBOx

    PDFTextStripper stripper = new PDFTextStripper();
    BUFFER.append(stripper.getText(PDDocument.load(pdf)));
    

    And this is for iText

    PdfReader reader = new PdfReader(pdf);
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
      BUFFER.append(PdfTextExtractor.getTextFromPage(reader, i));
    }
    

    My question is in what the performance depends, is there a way how to make PdfBox faster? Or only to use iText? And can you explain more about how strategies affect performance?

  • mkl
    mkl almost 6 years
    The remarks on the PDDocument.load method to use refer to the PDFBox 1.8.x architecture. Since 2.0.0 the former correct PDDocument.loadNonSeq has become PDDocument.load and the former incorrect PDDocument.load has been dropped.
  • Tilman Hausherr
    Tilman Hausherr over 5 years
    While I am happy that these changes are noticed, none of them play any role for text extraction. It is possible that other changes have improved text extraction performance since 2014, but I doubt that we're now faster than itext, assuming that the values mentioned are correct (I have not tried this myself with my copy of "Effective Java").
  • Aakash Patel
    Aakash Patel over 4 years
    We are required to move from iText to some other PDF API due to license constraints. What do you suggest as the best replacement of iText? Either opensource or One Time Payment for a license is also fine.
  • mkl
    mkl over 4 years
    That would be a question for the software recommendation stack exchange site. When asking there, don't forget to explain your use cases. E.g. there may be different recommendations if your use case is text extraction or if it is pdf creation.
  • mkl
    mkl almost 4 years
    "PDFBox always sorts." - actually this sorting can be disabled by an appropriately named setter. The default is on, though.