Using LocationTextExtractionStrategy in itextSharp for text coordinate
Its very important to understand that PDFs have no support for tables. Anything that looks like a table is really just a bunch of text placed at specific locations over a background of lines. This is very important and you need to keep this in mind as you work on this.
That said, you need to subclass TextExtractionStrategy
and pass that into GetTextFromPage()
. See this post for a simple example of that. Then see this post for a more complex example of subclassing. The latter isn't completely relevant to your goal but it does show some more complex things that you can do.
Related videos on Youtube
Vinay
Updated on June 04, 2022Comments
-
Vinay almost 2 years
My goal is to retrieve data from PDF which may be in table structure to an excel file.
using LocationTextExtractionStrategy with iTextSharp we can get the string data in plain text with page content in left to right manner.
How can I move forward such that during
PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy())
I could make the text retain its coordinate in the resulting string.
As for instance if the first line in the pdf has text aligned to right, then the resulting string must be containing trailing space or spaces keeping the content right aligned.
Please give some suggestions, how I may proceed to achieve the same.
-
Vinay over 12 yearsThanks @Chris for the solution. I am going to subclass it.
-
Vinay over 12 yearsafter subclassing it as
TextChunk location = new TextChunk(info.GetText(), bottomleft, topRight, info.GetSingleSpaceWidth()); locationalResult.Add(location);
and calling it asPdfTextExtractor.GetTextFromPage(reader, i, strategy
I am not getting the text in desired manner. Can you help me out where am I getting it wrong. -
Vinay over 12 yearsI could finally extract out the text with positions from the PDF.Thanks for the help.These days I was trying to put them as in table structure for excelfile, but till date I am unable to get a suitable dll or solution which would help me placing content in the excel file. Though I am thinking of creating and using excel template but presently I am having the text data as in dataview / datatable with text and postion information.
-
user1390375 almost 4 yearsEPPlus.dll and NPOI.dll (that's "npoi") are two DLLs that can read/write Excel .xlsx files. NPOI.dll can read/write Excel "BIFF" (.xls) files, too.