finding on which page a search string is located in a pdf document using python

python pdf pypdf

14,599

Solution 1

Finding on which page a search string is located in a pdf document using python

PyPDF2

 # import packages
    import PyPDF2
    import re
    
    # open the pdf file
    object = PyPDF2.PdfFileReader(r"source_file_path")
    
    # get number of pages
    NumPages = object.getNumPages()
    
    # define keyterms
    String = "P4F-21B"
    
    # extract text and do the search
    for i in range(0, NumPages):
        PageObj = object.getPage(i)
        Text = PageObj.extractText()
        ResSearch = re.search(String, Text)
        if ResSearch != None:
            print(ResSearch)
            print("Page Number" + str(i+1))

Output:

<re.Match object; span=(57, 64), match='P4F-21B'>
Page Number1

PyMuPDF

import fitz
import re

# load document
doc = fitz.open(r"C:\Users\shraddha.shetty\Desktop\OCR-pages-deleted.pdf")

# define keyterms
String = "P4F-21B"

# get text, search for string and print count on page.
for page in doc:
    text = ''
    text += page.get_text()
    if len(re.findall(String, text)) > 0:
        print(f'count on page {page.number + 1} is: {len(re.findall(String, text))}')

Solution 2

I finally figured out that pyPDF can help. I am posting it in case it can help somebody else.

(1) a function to locate the string

def fnPDF_FindText(xFile, xString):
    # xfile : the PDF file in which to look
    # xString : the string to look for
    import pyPdf, re
    PageFound = -1
    pdfDoc = pyPdf.PdfFileReader(file(xFile, "rb"))
    for i in range(0, pdfDoc.getNumPages()):
        content = ""
        content += pdfDoc.getPage(i).extractText() + "\n"
        content1 = content.encode('ascii', 'ignore').lower()
        ResSearch = re.search(xString, content1)
        if ResSearch is not None:
           PageFound = i
           break
     return PageFound

(2) a function to extract the pages of interest

  def fnPDF_ExtractPages(xFileNameOriginal, xFileNameOutput, xPageStart, xPageEnd):
      from pyPdf import PdfFileReader, PdfFileWriter
      output = PdfFileWriter()
      pdfOne = PdfFileReader(file(xFileNameOriginal, "rb"))
      for i in range(xPageStart, xPageEnd):
          output.addPage(pdfOne.getPage(i))
          outputStream = file(xFileNameOutput, "wb")
          output.write(outputStream)
          outputStream.close()

I hope this will be helpful to somebody else

Solution 3

In addition to what @user1043144 mentioned,

To use with python 3.x

Use PyPDF2

import PyPDF2

Use open instead of file

PdfFileReader(open(xFile, 'rb'))

14,599

Author by

user1043144

Updated on June 16, 2022

Comments

user1043144 almost 2 years

Which python packages can I use to find out out on which page a specific “search string” is located ?

I looked into several python pdf packages but couldn't figure out which one I should use. PyPDF does not seem to have this functionality and PDFMiner seems to be an overkill for such simple task. Any advice ?

More precise: I have several PDF documents and I would like to extract pages which are between a string “Begin” and a string “End” .
Scott B about 8 years

Thanks, this was helpful!
suchislife over 6 years

Hello Experts, I know it's been a long time but, how could I modify this code to Extract PDF Pages Containing a certain string and creating a new document of them?
Aakash Basu almost 5 years

I get this error when I try to search a string using the above code and using PyPDF2 and open instead of file in python 3.6. Error: TypeError: cannot use a string pattern on a bytes-like object
Aakash Basu almost 5 years

I get this error when I try to search a string using the above code and using PyPDF2 and open instead of file in python 3.6. Error: TypeError: cannot use a string pattern on a bytes-like object
Aakash Basu almost 5 years

I had to add one more line to convert byte to string explicitly and it worked. content2 = content1.decode("utf-8")
Aakash Basu almost 5 years

I had to add one more line to convert byte to string explicitly and it worked. content2 = content1.decode("utf-8")
mkl over 2 years

Near answer duplicate of this and this.