Reading pdf files line by line using python

python pypdf

13,885

import re
import PyPDF2

pdfFileObj = open('E://drive-download-20171015T225604Z-001/test_case/test2/try/xyz.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print("Number of pages:-"+str(pdfReader.numPages))
num = pdfReader.numPages
i =0
while(i<num):
    pageObj = pdfReader.getPage(i)
    text=pageObj.extractText()
    text1 = text.lower()
    for line in text1:
        if(re.search("abc",line)):
            print(line)
    i= i+1

I use it to iterate page by page of pdf and search for key terms in it and process further.

13,885

Author by

Rahul Pipalia

Updated on June 24, 2022

Comments

Rahul Pipalia almost 2 years

I used the following code to read the pdf file, but it does not read it. What could possibly be the reason?

>>> import os 

>>> from PyPDF2 import PdfFileReader, PdfFileWriter

>>> path = "/Users/Rahul/Desktop/Dfiles/"

>>> dirs = os.listdir( path )

>>> directory = "/Users/Rahul/Desktop/Dfiles/106_2015_34-76357.pdf"

>>> f = open(directory, 'rb')

>>> reader = PdfFileReader(f)

>>> contents = reader.getPage(0).extractText().split('\n')

>>> f.close()

>>> print contents

The output is [u''] instead of reading the content.