Using LocationTextExtractionStrategy in itextSharp for text coordinate

17,409

Its very important to understand that PDFs have no support for tables. Anything that looks like a table is really just a bunch of text placed at specific locations over a background of lines. This is very important and you need to keep this in mind as you work on this.

That said, you need to subclass TextExtractionStrategy and pass that into GetTextFromPage(). See this post for a simple example of that. Then see this post for a more complex example of subclassing. The latter isn't completely relevant to your goal but it does show some more complex things that you can do.

Share:
17,409

Related videos on Youtube

Vinay
Author by

Vinay

Updated on June 04, 2022

Comments

  • Vinay
    Vinay almost 2 years

    My goal is to retrieve data from PDF which may be in table structure to an excel file.

    using LocationTextExtractionStrategy with iTextSharp we can get the string data in plain text with page content in left to right manner.

    How can I move forward such that during

    PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy())

    I could make the text retain its coordinate in the resulting string.

    As for instance if the first line in the pdf has text aligned to right, then the resulting string must be containing trailing space or spaces keeping the content right aligned.

    Please give some suggestions, how I may proceed to achieve the same.

  • Vinay
    Vinay over 12 years
    Thanks @Chris for the solution. I am going to subclass it.
  • Vinay
    Vinay over 12 years
    after subclassing it as TextChunk location = new TextChunk(info.GetText(), bottomleft, topRight, info.GetSingleSpaceWidth()); locationalResult.Add(location); and calling it as PdfTextExtractor.GetTextFromPage(reader, i, strategy I am not getting the text in desired manner. Can you help me out where am I getting it wrong.
  • Vinay
    Vinay over 12 years
    I could finally extract out the text with positions from the PDF.Thanks for the help.These days I was trying to put them as in table structure for excelfile, but till date I am unable to get a suitable dll or solution which would help me placing content in the excel file. Though I am thinking of creating and using excel template but presently I am having the text data as in dataview / datatable with text and postion information.
  • user1390375
    user1390375 almost 4 years
    EPPlus.dll and NPOI.dll (that's "npoi") are two DLLs that can read/write Excel .xlsx files. NPOI.dll can read/write Excel "BIFF" (.xls) files, too.