finding on which page a search string is located in a pdf document using python
Solution 1
Finding on which page a search string is located in a pdf document using python
PyPDF2
# import packages
import PyPDF2
import re
# open the pdf file
object = PyPDF2.PdfFileReader(r"source_file_path")
# get number of pages
NumPages = object.getNumPages()
# define keyterms
String = "P4F-21B"
# extract text and do the search
for i in range(0, NumPages):
PageObj = object.getPage(i)
Text = PageObj.extractText()
ResSearch = re.search(String, Text)
if ResSearch != None:
print(ResSearch)
print("Page Number" + str(i+1))
Output:
<re.Match object; span=(57, 64), match='P4F-21B'>
Page Number1
PyMuPDF
import fitz
import re
# load document
doc = fitz.open(r"C:\Users\shraddha.shetty\Desktop\OCR-pages-deleted.pdf")
# define keyterms
String = "P4F-21B"
# get text, search for string and print count on page.
for page in doc:
text = ''
text += page.get_text()
if len(re.findall(String, text)) > 0:
print(f'count on page {page.number + 1} is: {len(re.findall(String, text))}')
Solution 2
I finally figured out that pyPDF can help. I am posting it in case it can help somebody else.
(1) a function to locate the string
def fnPDF_FindText(xFile, xString):
# xfile : the PDF file in which to look
# xString : the string to look for
import pyPdf, re
PageFound = -1
pdfDoc = pyPdf.PdfFileReader(file(xFile, "rb"))
for i in range(0, pdfDoc.getNumPages()):
content = ""
content += pdfDoc.getPage(i).extractText() + "\n"
content1 = content.encode('ascii', 'ignore').lower()
ResSearch = re.search(xString, content1)
if ResSearch is not None:
PageFound = i
break
return PageFound
(2) a function to extract the pages of interest
def fnPDF_ExtractPages(xFileNameOriginal, xFileNameOutput, xPageStart, xPageEnd):
from pyPdf import PdfFileReader, PdfFileWriter
output = PdfFileWriter()
pdfOne = PdfFileReader(file(xFileNameOriginal, "rb"))
for i in range(xPageStart, xPageEnd):
output.addPage(pdfOne.getPage(i))
outputStream = file(xFileNameOutput, "wb")
output.write(outputStream)
outputStream.close()
I hope this will be helpful to somebody else
Solution 3
In addition to what @user1043144 mentioned,
To use with python 3.x
Use PyPDF2
import PyPDF2
Use open
instead of file
PdfFileReader(open(xFile, 'rb'))
user1043144
Updated on June 16, 2022Comments
-
user1043144 almost 2 years
Which python packages can I use to find out out on which page a specific “search string” is located ?
I looked into several python pdf packages but couldn't figure out which one I should use. PyPDF does not seem to have this functionality and PDFMiner seems to be an overkill for such simple task. Any advice ?
More precise: I have several PDF documents and I would like to extract pages which are between a string “Begin” and a string “End” .
-
Scott B about 8 yearsThanks, this was helpful!
-
suchislife over 6 yearsHello Experts, I know it's been a long time but, how could I modify this code to Extract PDF Pages Containing a certain string and creating a new document of them?
-
Aakash Basu almost 5 yearsI get this error when I try to search a string using the above code and using PyPDF2 and open instead of file in python 3.6. Error: TypeError: cannot use a string pattern on a bytes-like object
-
Aakash Basu almost 5 yearsI get this error when I try to search a string using the above code and using PyPDF2 and open instead of file in python 3.6. Error: TypeError: cannot use a string pattern on a bytes-like object
-
Aakash Basu almost 5 yearsI had to add one more line to convert byte to string explicitly and it worked. content2 = content1.decode("utf-8")
-
Aakash Basu almost 5 yearsI had to add one more line to convert byte to string explicitly and it worked. content2 = content1.decode("utf-8")
-
mkl over 2 years