iTextSharp PdfTextExtractor GetTextFromPage Throwing NullReferenceException

11,748

To summarize what has been found out in the comments to the question...

In short

The PDF the OP at first used is invalid: It misses required objects which are of interest to the parser.

Since he finally got hold on a valid version, he now is able to parse successfully.

In detail

Depending on the time and mode of request, the web site the PDFs in question were requested from returned different versions of the same document, sometimes complete, sometimes in an invalid manner incomplete.

The test file was stockQuotes_03232015.pdf, i.e. the PDF containing the data generated on the test day:

The complete file could already be recognized by size, in my downloads it is 250933 bytes long while my incomplete file is 81062 bytes long.

Inspecting the files it looks like the incomplete file has been derived from the complete one by some tool which removed duplicate image streams but forgot to change the references to the removed streams by references to the retained stream object.

Share:
11,748
heyou
Author by

heyou

Updated on June 04, 2022

Comments

  • heyou
    heyou almost 2 years

    I am using iTextSharp for reading PDF documents but lately it seems that i'm getting a

    {"Object reference not set to an instance of an object."}

    or NullReferenceException upon getting the text from the page of PdfReader. Before it is working but after this day, it is not already working. I didn't change my code.

    Below is my code:

    for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                ITextExtractionStrategy its = new SimpleTextExtractionStrategy();
                string currentText = PdfTextExtractor.GetTextFromPage(reader, i, its);
                if (currentText.Contains("ADVANCES"))
                {
                    return i;
                }
            }
    
            return 0;
    

    The above code throws a null reference exception, reader is not null and i is obviously not null being an int.

    I am instantiating the PDFreader from the input stream

    PdfReader reader = new PdfReader(_stream)
    

    Below is the stack trace:

      at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.DisplayXObject(PdfName xobjectName)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
       at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
       at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener)
       at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
    

    To be simple, i tried to create a simple console application that will just read all the text from the PDF file and display it. Below is the code. Result is the same as above, it gives NullReferenceException.

    class Program
        {
    
    
    
     static void Main(string[] args)
        {
            Console.WriteLine(ExtractTextFromPdf(@"stockQuotes_03232015.pdf"));
        }
    
        public static string ExtractTextFromPdf(string path)
        {
            using (PdfReader reader = new PdfReader(path))
            {
                StringBuilder text = new StringBuilder();
    
                for (int i = 1; i <= reader.NumberOfPages; i++)
                {
                    text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
                }
    
                return text.ToString();
            }
        }
    }
    

    Does anyone know what might be going on here or how i might work around it?