Reading pdf files line by line using python

13,885
import re
import PyPDF2

pdfFileObj = open('E://drive-download-20171015T225604Z-001/test_case/test2/try/xyz.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print("Number of pages:-"+str(pdfReader.numPages))
num = pdfReader.numPages
i =0
while(i<num):
    pageObj = pdfReader.getPage(i)
    text=pageObj.extractText()
    text1 = text.lower()
    for line in text1:
        if(re.search("abc",line)):
            print(line)
    i= i+1

I use it to iterate page by page of pdf and search for key terms in it and process further.

Share:
13,885
Rahul Pipalia
Author by

Rahul Pipalia

Updated on June 24, 2022

Comments

  • Rahul Pipalia
    Rahul Pipalia almost 2 years

    I used the following code to read the pdf file, but it does not read it. What could possibly be the reason?

    >>> import os 
    
    >>> from PyPDF2 import PdfFileReader, PdfFileWriter
    
    >>> path = "/Users/Rahul/Desktop/Dfiles/"
    
    >>> dirs = os.listdir( path )
    
    >>> directory = "/Users/Rahul/Desktop/Dfiles/106_2015_34-76357.pdf"
    
    >>> f = open(directory, 'rb')
    
    >>> reader = PdfFileReader(f)
    
    >>> contents = reader.getPage(0).extractText().split('\n')
    
    >>> f.close()
    
    >>> print contents
    

    The output is [u''] instead of reading the content.

  • Kickaha
    Kickaha over 4 years
    It's overly complex to show a directory walk. Answer the question asked as well.