Reading a table or cell value in a pdf file using java?

10,634

Solution 1

In comments the OP clarified that he locates the text value from the table in a pdf file he wants to extract

By providing X and Y co-ordinates

Thus, while the question initially sounded like generic extraction of tabular data from PDFs (which can be difficult at least), it actually is essentially about extracting the text from a rectangular region on a page given by coordinates.

This is possible using either of the libraries you mentioned (and surely others, too).

iText

To restrict the region from which you want to extract text, you can use the RegionTextRenderFilter in a FilteredTextRenderListener, e.g.:

/**
 * Parses a specific area of a PDF to a plain text file.
 * @param pdf the original PDF
 * @param txt the resulting text
 * @throws IOException
 */
public void parsePdf(String pdf, String txt) throws IOException {
    PdfReader reader = new PdfReader(pdf);
    PrintWriter out = new PrintWriter(new FileOutputStream(txt));
    Rectangle rect = new Rectangle(70, 80, 490, 580);
    RenderFilter filter = new RegionTextRenderFilter(rect);
    TextExtractionStrategy strategy;
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
        out.println(PdfTextExtractor.getTextFromPage(reader, i, strategy));
    }
    out.flush();
    out.close();
    reader.close();
}

(ExtractPageContentArea sample from iText in Action, 2nd edition)

Beware, though, iText extracts text based on the basic text chunks in the content stream, not based on each individual glyph in such a chunk. Thus, the whole chunk is processed if only the tiniest part of it is in the area.

This may or may not suit you.

If you run into the problem that more is extracted than you wanted, you should split the chunks into their constituting glyphs beforehand. This stackoverflow answer explains how to do that.

PDFBox

To restrict the region from which you want to extract text, you can use the PDFTextStripperByArea, e.g.:

PDDocument document = PDDocument.load( args[0] );
if( document.isEncrypted() )
{
    document.decrypt( "" );
}
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
Rectangle rect = new Rectangle( 10, 280, 275, 60 );
stripper.addRegion( "class1", rect );
List allPages = document.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get( 0 );
stripper.extractRegions( firstPage );
System.out.println( "Text in the area:" + rect );
System.out.println( stripper.getTextForRegion( "class1" ) );

(ExtractTextByArea from the PDFBox 1.8.8 examples)

Solution 2

Try PDFTextStream. At least I am able to identify the column values. Earlier, I was using iText and got stuck in defining strategy. Its hard.

This api separates column cells by putting more spaces. Its fixed. you can put logic. (this was missing in iText).

import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.OutputTarget;

public class PDFText {
    public static void main(String[] args) throws java.io.IOException {
        String pdfFilePath = "xyz.pdf";

        Document pdf = PDF.open(pdfFilePath);
        StringBuilder text = new StringBuilder(1024);
        pdf.pipe(new OutputTarget(text));
        pdf.close();
        System.out.println(text);
   }
}

Question has been asked related to this on stackoverflow!

Share:
10,634
sgelle
Author by

sgelle

Updated on June 29, 2022

Comments

  • sgelle
    sgelle almost 2 years

    I have gone through Java and PDF forums to extract a text value from the table in a pdf file, but could't find any solution except JPedal (It's not opensource and licensed).

    So, I would like to know any opensource API's like pdfbox, itext to achieve the same result as JPedal.

    Ref. Example:

    Sample Table

    • BretC
      BretC over 9 years
      I remember using a free library called iText many moons ago... itextpdf.com
    • Bruno Lowagie
      Bruno Lowagie over 9 years
      iText is licensed as open source too. See Is iText Java library free of charge or have any fees to be paid? for more info. This being said, you need to answer this counter-question before anyone can help you: is the PDF a Tagged PDF or not? If not, there is no table inside the PDF. Watch this video to learn more about structure. Where your human eyes may see a table, a machine may only see lines and characters without any structure.
    • mkl
      mkl over 9 years
      How do you locate the text value from the table in a pdf file?
    • sgelle
      sgelle over 9 years
      @mkl - By providing X and Y co-ordinates, this way JPedal implemented the logic.
    • mkl
      mkl over 9 years
      That's possible for others, too.
  • sgelle
    sgelle over 9 years
    Hi mkl, with this solution white spaces are truncated due to this unable to find which data denotes to which column. Is there any way to retain white spaces?
  • mkl
    mkl over 9 years
    For iText look at this answer which explains how to create a text extraction strategy based on the LocationTextExtractionStrategy which attempts to reflect the horizontal layout of the PDF by inserting spaces where necessary. Equivalent techniques should be possible for PDFBox.
  • mkl
    mkl about 9 years
    @sgelle This answer explains how to use PDFBox text extraction in a manner that attempts to reflect the horizontal layout of the PDF by inserting spaces where necessary.