Difference between iTextSharp 4.1.6 and 5.x versions

15,849

I'm the CTO of iText Software, so just like Michaël who already answered in the comment section, I'm at the same time the most authoritative source as well as a biased source.

There's a very simple comparison chart on the iText web site.

This chart doesn't cover text extraction, so allow me to list the relevant improvements since iText 5.

You've probably also found this page.

In case you wonder about the bug fixes and the performance improvements regarding text parsing, this is a more exhaustive list:

  • 5.0.0: Text extraction: major overhaul to perform calculations in user space. This allows the parser to correctly determine line breaks, even if the text or page is rotated.
  • 5.0.1: Refactored callback so method signature won't need to change as render callback API evolves.
  • 5.0.1: Refactoring to make it easier for outside users to interact with the content stream processor. Also refactored render listener so text and image event listening occurs in the same interface (reduces a lot of non-value-add complexity)
  • 5.0.1: New filtering functionality for text renderers.
  • 5.0.1: Additional utility method for previewing PDF content.
  • 5.0.1: Added a much more advanced text renderer listener that can reconstruct page content based on physical location of text on the page
  • 5.0.1: Added support for XObject Form processing (text added via PdfTemplate can now be parsed)
  • 5.0.1: Added rudimentary support for XObject Image callbacks
  • 5.0.1: Bug fix - text extraction wasn't correct for certain page orientations
  • 5.0.1: Bug fix - matrices were being concatenated in the wrong order.
  • 5.0.1: PdfTextExtractor: changed the default render listener (new location aware strategy)
  • 5.0.1: Getters for GraphicsState
  • 5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
  • 5.0.2: CMapAwareDocumentFont: Tweaks to make processing quasi-invalid PDF files more robust
  • 5.0.2: PdfContentReaderTool: null pointer handling, plus a few well placed flush calls
  • 5.0.2: PdfContentReaderTool: Show details on resource entries
  • 5.0.2: PdfContentStreamProcessor: Adjustment so embedded images don't cause parsing problems and improvements to EI detection
  • 5.0.2: LocationTextExtractionStrategy: Fixed anti-parallel algorithm, plus accounting for negative inter-character offsets. Change to text extraction strategy that builds out the text model first, then computes concatenation requirements.
  • 5.0.2: Adjustments to linesegment implementation; optimalization of changes made by Bruno to text extraction; for example: introduction of the class MarkedContentInfo.
  • 5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
  • 5.0.3: added method to get area of image in user units
  • 5.0.3: better parsing of inline images
  • 5.0.3: Adding an extra check for begin/end sequences when parsing a ToUnicode stream.
  • 5.0.4: Content streams in arrays should be parsed as if they were separated by whitespace
  • 5.0.4: Expose CTM
  • 5.0.4: Refactor to pull inline image processing into it's own class. Added parsing of image data if there is no filter applied (there are some PDFs where there is no white space between the end of the image data and the EI operator). Ultimately, it will be best to actually parse the image data, but this will require a pretty big refactoring of the iText decoders (to work from streams instead of byte[] of known lengths).
  • 5.0.4: Handle multi-stage filters; Correct bug that pulled whitespace as first byte of inline image stream.
  • 5.0.4: Applying stream filters to inline images.
  • 5.0.4: PdfReader: Expose filter decoder for arbitrary byte arrays (instead of only streams)
  • 5.0.6: CMapParser: Fix to read broken ToUnicode cmaps.
  • 5.0.6: handle slightly malformed embedded images
  • 5.0.6: CMapAwareDocumentFont: Some PDFs have a diff map bigger than 256 characters.
  • 5.0.6: performance: Cache the fonts used in text extraction
  • 5.1.2: PRTokeniser: Made the algorithm to find startxref more memory efficient.
  • 5.1.2: RandomAccessFileOrArray: Improved handling for huge files that can't be mapped
  • 5.1.2: CMapAwareDocumentFont: fix NPE if mapping doesn't get initialized (I'd rather wind up with junk characters than throw an unexpected exception down the road)
  • 5.1.3: refactoring of how filters are applied to streams, adjust parser so it can handle multi-stage filters
  • 5.1.3: images: allow correct decoding of 1bpc bitmask images
  • 5.1.3: images: add jbig2 streams to pass through
  • 5.1.3: images: handle null and indirect references in decode parameters, throw exception if unable to decode an image
  • 5.2.0: Better error messages and better handling zero sized files and attempts to read past the end of the file.
  • 5.2.0: Removed restriction that using memory mapping requires the file be smaller than ~2GB.
  • 5.2.0: Avoid NullPointerException in RandomAccessFileOrArray
  • 5.2.0: Made a utility method in pdfContentStreamProcessor private and clarified the stateful nature of the class
  • 5.2.0: LocationTextExtractionStrategy: bounds checking on string lengths and refactoring to make code easier to read.
  • 5.2.0: Better handling of color space dictionaries in images.
  • 5.2.0: improve handling of quasi improper inline image content.
  • 5.2.0: don't decode inline image streams until we absolutely need them.
  • 5.2.0: avoid NullPointerException of resource dictionary isn't provided.
  • 5.3.0: LocationTextExtractionStrategy: old comparison approach caused runtime exceptions in Java 7
  • 5.3.3: incorporate the text-rise parameter
  • 5.3.3: expose glyph-by-glyph information
  • 5.3.3: Bugfix: text to user space transformation was being applied multiple times for sub-textrenderinfo objects
  • 5.3.3: Bugfix: Correct baseline calculation so it doesn't include final character spacing
  • 5.3.4: Added low-level filtering hook to LocationTextExtractionStrategy.
  • 5.3.5: Fixed bug in PRTokeniser: handle case where number is at end of stream.
  • 5.3.5: Replaced StringBuffer with StringBuilder in PRTokeniser for performance reasons.
  • 5.4.2: Added an isChunkAtWordBoundary() method to LocationTextExtractionStrategy to check if a space character should be inserted between a previous chunk and the current one.
  • 5.4.2: Added a getCharSpaceWidth() method to LocationTextExtractionStrategy to get the width of a space character.
  • 5.4.2: Added a getText() method to LocationTextExtractionStrategy to get the text of the current Chunk.
  • 5.4.2: Added an appendTextChunk(() method to SimpleTextExtractionStrategy to expose the append process so that subclasses can add text from outside the text parse operation.
  • 5.4.5: Added MultiFilteredRenderListener class for PDF parser.
  • 5.4.5: Added GlyphRenderListener and GlyphTextRenderListener classes for processing each glyph rather than processing chunks of text.
  • 5.4.5: Added method getMcid() in TextRenderInfo.
  • 5.4.5: fixed resource leak when many inline images were in content stream
  • 5.5.0: CMapAwareDocumentFont: if font space width isn't defined, use the default width for the font.
  • 5.5.0: PdfContentReader: avoid exception when displaying an empty dictionary.

There are some things that you won't be able to do if you don't upgrade. For instance, you won't be able to do the things described in these slides.

If you look at the roadmap for iText, you'll see that we'll invest even more time on text extraction in the future.

In all honesty: using the 5 year old version wouldn't only be like reinventing the wheel, it would also be like falling in every pitfall we've fallen in in the last 5 years. I can assure you that buying a license will be less expensive.

Share:
15,849
Shanky
Author by

Shanky

Updated on June 04, 2022

Comments

  • Shanky
    Shanky almost 2 years

    We are developing a Pdf parser to be used along with our system. The requirement is such that, we store all the information on any pdf documents and should be able to reproduce the document as such (with minimal changes from original document).

    We did some googling and found iTextSharp be the best mate for our purpose. We are developing our project using .net.

    You might have guessed as i mentioned in my title requiring comparisons for specific versions of iTextSharp (4.1.6 vs 5.x). We know that 4.1.6 is the last version of iTextSharp with the LGPL/MPL license . The 5.x versions are AGPL.

    We would like to have a good comparison between the versions before choosing the LGPL version or we buy the license for AGPL (we dont like to publish our code).

    I did some browsing through the revision changes in the iTextSharp but i would like to know if any content exist, making a good comparison between the versions.

    Thanks in advance!

  • Shanky
    Shanky almost 10 years
    @Lowagie. Thanks a lot for Turning up!. I would like to hear about the legal infringement that may occur if i use v 4.1.6 . However, developer Bruno Lowagie has warned that versions prior to 5 may have included code that was not legally licensed under the LGPL, so that closed-source users of previous versions may be liable for copyright infringement. These were the lines written in Wiki on iText Page
  • Bruno Lowagie
    Bruno Lowagie almost 10 years
    As you are not a customer, there is no reason why we should disclose this info. Moreover: we've agreed with the contributor who introduced the code that we would not disclose this information. In any case: you're not doing any one (not yourself, not your customers) a favor if you decide to have them invest in an old version of iText.
  • Shanky
    Shanky almost 10 years
    Thanks. Just confirming is it worth while to buy license for 5.x . I have one more query which is outside the scope of this question. Do you have any tutorial or ebook that explains more in brief about parsing (extracting text, images and other stuffs) the pdf using iTextSharp. i do have your iText in Action editions 1 and 2. Even those books concentrates more on creating a pdf rather than the extraction part. please help me with some links.
  • Bruno Lowagie
    Bruno Lowagie almost 10 years
    I'm currently writing "The ABC of PDF". Only when that is finished, I'll start writing other books (depending on what customers need). We do have experience and material on this subject (as shown on these slides: slideshare.net/iTextPDF/itext-summit-2014-talk-unstructured-‌​pdf ), but for now, these materials are only provided to our customer GlobalSubmit (the company that is mentioned in the slides). Wouldn't it be silly if we gave everything away for free? We'd be very bad engineers, wouldn't we?
  • Bruno Lowagie
    Bruno Lowagie over 7 years
    @richard There's no need to insult the developers who wrote the great software you love and use. As long as you distribute your software for free, you don't need to buy a commercial license. As you probably know, all code you find on Stack Overflow also comes with a license. If you copy a snippet from Stack Overflow, you are using it under the "Creative Commons By Attribution Share Alike" license. That means that you should share your code under the same CC-BY-SA license. That's the honest thing to do.
  • richard
    richard over 7 years
    I said it works great. That's not an insult to the developers.
  • Bruno Lowagie
    Bruno Lowagie over 7 years
    OK @richard I thought you meant to say that developers don't deserve to be paid for their work maintaining the software, keeping it up-to-date with the latest PDF standards. You sounded as if you expect developers to work for free.
  • Bruno Lowagie
    Bruno Lowagie over 7 years
    @amedee Er... This isn't what I heard: meta.stackexchange.com/questions/285711/…