PDFBox: working with very large PDFs.

14,345

In the 2.0.* versions, open the PDF like this:

PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());

This will setup buffering memory usage to only use temporary file(s) (no main-memory) with not restricted size.

Update 17.4.2018: More tricks to save memory are described in the FAQ. Not yet described but active since 2.0.9 is subsampling (skip pixel lines/rows) with PDFRenderer.setSubsamplingAllowed(true) when doing rendering. This saves space for PDF files with huge image files.

Share:
14,345
Pengo
Author by

Pengo

Updated on July 19, 2022

Comments

  • Pengo
    Pengo almost 2 years

    I am working with some very large PDFs, some over 7GB in size. The PDFs have up to 20,000 pages and many full page color images. I'd like to use PDFBox to work with the PDFs, but due to the size I get OutOfMemoryError's when I attempt to open the PDFs.

    I'm working with version pdfbox-app-1.6.0, on Windows 7 using Intellij, java 6.

    First I tried writing a simple program that just opened the PDF in a PDDocument and coping each page over to another PDDocument: http://ideone.com/arKhB

    Next I tried using the PDFBox CopyDoc example.

    Both example run out of memory.

    I'm assuming this is because PDFBox is trying to read the whole document into memory. Is there a way to have it only open 1 page at a time? I know it would be slower processing, but at the moment I can't process anything.

  • Daredevil
    Daredevil over 5 years
    Hi, I had the same issue but insetad I am dealing with words (text) in pdf files. I tried to index like 10 million words in a single pdf file and it gives me out of memory error:java heap space. I tried your sugggestion above but it still doesn't fix it. Any other idea to try?
  • Tilman Hausherr
    Tilman Hausherr over 5 years
    more memory with -Xmx . And always make sure you're using the latest PDFBox version.
  • Daredevil
    Daredevil over 5 years
    I'm using PDFBox 2.0 . Also I tried altering -Xmx. It stills throw out of memory
  • Tilman Hausherr
    Tilman Hausherr over 5 years
    which 2.0 ? And what -Xmx value? I use -Xmx1g and sometimes -Xmx4g for my work. If it still doesn't do it, the best would be to share the PDF, the smallest possible code that reproduces the error and create a new question with this. But I wonder what kind of PDF would have more than a million words?
  • Daredevil
    Daredevil over 5 years
    Well I'm just trying to stress test to see how much the PDFbox can deal with such issues.