Reading PDF content with itextsharp dll in VB.NET or C#

237,541

Solution 1

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.IO;

public string ReadPdfFile(string fileName)
{
    StringBuilder text = new StringBuilder();

    if (File.Exists(fileName))
    {
        PdfReader pdfReader = new PdfReader(fileName);

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
            string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

            currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
            text.Append(currentText);
        }
        pdfReader.Close();
    }
    return text.ToString();
}

Solution 2

LGPL / FOSS iTextSharp 4.x

var pdfReader = new PdfReader(path); //other filestream etc
byte[] pageContent = _pdfReader .GetPageContent(pageNum); //not zero based
byte[] utf8 = Encoding.Convert(Encoding.Default, Encoding.UTF8, pageContent);
string textFromPage = Encoding.UTF8.GetString(utf8);

None of the other answers were useful to me, they all seem to target the AGPL v5 of iTextSharp. I could never find any reference to SimpleTextExtractionStrategy or LocationTextExtractionStrategy in the FOSS version.

Something else that might be very useful in conjunction with this:

const string PdfTableFormat = @"\(.*\)Tj";
Regex PdfTableRegex = new Regex(PdfTableFormat, RegexOptions.Compiled);

List<string> ExtractPdfContent(string rawPdfContent)
{
    var matches = PdfTableRegex.Matches(rawPdfContent);

    var list = matches.Cast<Match>()
        .Select(m => m.Value
            .Substring(1) //remove leading (
            .Remove(m.Value.Length - 4) //remove trailing )Tj
            .Replace(@"\)", ")") //unencode parens
            .Replace(@"\(", "(")
            .Trim()
        )
        .ToList();
    return list;
}

This will extract the text-only data from the PDF if the text displayed is Foo(bar) it will be encoded in the PDF as (Foo\(bar\))Tj, this method would return Foo(bar) as expected. This method will strip out lots of additional information such as location coordinates from the raw pdf content.

Solution 3

Here is a VB.NET solution based on ShravankumarKumar's solution.

This will ONLY give you the text. The images are a different story.

Public Shared Function GetTextFromPDF(PdfFileName As String) As String
    Dim oReader As New iTextSharp.text.pdf.PdfReader(PdfFileName)

    Dim sOut = ""

    For i = 1 To oReader.NumberOfPages
        Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy

        sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
    Next

    Return sOut
End Function

Solution 4

In my case, I just wanted the text from a specific area of the PDF document so I used a rectangle around the area and extracted the text from it. In the sample below the coordinates are for the entire page. I don't have PDF authoring tools so when it came time to narrow down the rectangle to the specific location I took a few guesses at the coordinates until the area was found.

Rectangle _pdfRect = new Rectangle(0f, 0f, 612f, 792f); // Entire page - PDF coordinate system 0,0 is bottom left corner.  72 points / inch
RenderFilter _renderfilter = new RegionTextRenderFilter(_pdfRect);
ITextExtractionStrategy _strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), _filter);
string _text = PdfTextExtractor.GetTextFromPage(_pdfReader, 1, _strategy);

As noted by the above comments the resulting text doesn't maintain any of the formatting found in the PDF document, however, I was happy that it did preserve the carriage returns. In my case, there were enough constants in the text that I was able to extract the values that I required.

Solution 5

Here an improved answer of ShravankumarKumar. I created special classes for the pages so you can access words in the pdf based on the text rows and the word in that row.

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

//create a list of pdf pages
var pages = new List<PdfPage>();

//load the pdf into the reader. NOTE: path can also be replaced with a byte array
using (PdfReader reader = new PdfReader(path))
{
    //loop all the pages and extract the text
    for (int i = 1; i <= reader.NumberOfPages; i++)
    {
        pages.Add(new PdfPage()
        {
           content = PdfTextExtractor.GetTextFromPage(reader, i)
        });
    }
}

//use linq to create the rows and words by splitting on newline and space
pages.ForEach(x => x.rows = x.content.Split('\n').Select(y => 
    new PdfRow() { 
       content = y,
       words = y.Split(' ').ToList()
    }
).ToList());

The custom classes

class PdfPage
{
    public string content { get; set; }
    public List<PdfRow> rows { get; set; }
}


class PdfRow
{
    public string content { get; set; }
    public List<string> words { get; set; }
}

Now you can get a word by row and word index.

string myWord = pages[0].rows[12].words[4];

Or use Linq to find the rows containing a specific word.

//find the rows in a specific page containing a word
var myRows = pages[0].rows.Where(x => x.words.Any(y => y == "myWord1")).ToList();

//find the rows in all pages containing a word
var myRows = pages.SelectMany(r => r.rows).Where(x => x.words.Any(y => y == "myWord2")).ToList();
Share:
237,541

Related videos on Youtube

user221185
Author by

user221185

Updated on July 05, 2022

Comments

  • user221185
    user221185 almost 2 years

    How can I read PDF content with the itextsharp with the Pdfreader class. My PDF may include Plain text or Images of the text.

    • Peter Huber
      Peter Huber over 3 years
      iTextSharp is now called "iText 7 for .NET"or "itext7-dotnet" on github: link. It's recommended to add itext7 with Nuget to your solution.
  • Carter Medlin
    Carter Medlin over 12 years
    This should be marked as the solution! This works great for me.
  • Avi
    Avi over 12 years
    When I try this on my PDF, it gives me the error message, "Value cannot be null. Parameter name: value". Any idea what this is about?
  • Avi
    Avi over 12 years
    sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(‌​oReader, i, its). Also, I figured something out about this error. If I take it out of the loop and parse the individual pages, it works on one page and not the other. The only difference between the two that I can tell is that the problematic page has images on it (which I don't need).
  • Avi
    Avi over 12 years
    If you'd like to have a look at the PDF, I can send it to you.
  • Carter Medlin
    Carter Medlin over 12 years
    I'm using .Net 4.0 and itextsharp 5.1.2.0 (Just downloaded). Same with you?
  • Avi
    Avi over 12 years
    .Net 3.5 and itextsharp 5.1.1. I'll update and see if it's resolved.
  • Th 00 mÄ s
    Th 00 mÄ s over 11 years
    Any particular reason the pdfReader.Close(); happens inside the for loop?
  • Sebastian
    Sebastian almost 11 years
    why using .Close() at all and not using (var pdfReader = ...) {}
  • Sebastian
    Sebastian almost 11 years
    Also, ASCIIEncoding.Convert should be Encoding.Convert as it is a static method
  • mkl
    mkl over 9 years
    You are right, before 5.x.x text extraction was present in iText merely as proof-of-concept and in iTextSharp not at all. That being said, the code you present only works in very primitively built PDFs (using fonts with an ASCII'ish encoding and Tj as only text drawing operator). It may be usable in very controlled environments (in which you can ensure to only get such primitive PDFs) but not in general.
  • AaA
    AaA almost 5 years
    Question is asking to read a PDF file, your answer is creating one!
  • Vikas Lalwani
    Vikas Lalwani almost 4 years
    If anyone need's code similar to above one, step by step implementation to read text of pdf in C#, here is the link, qawithexperts.com/article/c-sharp/… thanks
  • Diego
    Diego over 3 years
    The correct Regex expression is: (?<=()(.*?)(?=) Tj)