Reading pdf files line by line using python
13,885
import re
import PyPDF2
pdfFileObj = open('E://drive-download-20171015T225604Z-001/test_case/test2/try/xyz.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print("Number of pages:-"+str(pdfReader.numPages))
num = pdfReader.numPages
i =0
while(i<num):
pageObj = pdfReader.getPage(i)
text=pageObj.extractText()
text1 = text.lower()
for line in text1:
if(re.search("abc",line)):
print(line)
i= i+1
I use it to iterate page by page of pdf and search for key terms in it and process further.
Author by
Rahul Pipalia
Updated on June 24, 2022Comments
-
Rahul Pipalia almost 2 years
I used the following code to read the pdf file, but it does not read it. What could possibly be the reason?
>>> import os >>> from PyPDF2 import PdfFileReader, PdfFileWriter >>> path = "/Users/Rahul/Desktop/Dfiles/" >>> dirs = os.listdir( path ) >>> directory = "/Users/Rahul/Desktop/Dfiles/106_2015_34-76357.pdf" >>> f = open(directory, 'rb') >>> reader = PdfFileReader(f) >>> contents = reader.getPage(0).extractText().split('\n') >>> f.close() >>> print contents
The output is [u''] instead of reading the content.
-
Kickaha over 4 yearsIt's overly complex to show a directory walk. Answer the question asked as well.