Searching text in a PDF using Python?

111,653

Solution 1

This is called PDF mining, and is very hard because:

  • PDF is a document format designed to be printed, not to be parsed. Inside a PDF document, text is in no particular order (unless order is important for printing), most of the time the original text structure is lost (letters may not be grouped as words and words may not be grouped in sentences, and the order they are placed in the paper is often random).
  • There are tons of software generating PDFs, many are defective.

Tools like PDFminer use heuristics to group letters and words again based on their position in the page. I agree, the interface is pretty low level, but it makes more sense when you know what problem they are trying to solve (in the end, what matters is choosing how close from the neighbors a letter/word/line has to be in order to be considered part of a paragraph).

An expensive alternative (in terms of time/computer power) is generating images for each page and feeding them to OCR, may be worth a try if you have a very good OCR.

So my answer is no, there is no such thing as a simple, effective method for extracting text from PDF files - if your documents have a known structure, you can fine-tune the rules and get good results, but it is always a gambling.

I would really like to be proven wrong.

[update]

The answer has not changed but recently I was involved with two projects: one of them is using computer vision in order to extract data from scanned hospital forms. The other extracts data from court records. What I learned is:

  1. Computer vision is at reach of mere mortals in 2018. If you have a good sample of already classified documents you can use OpenCV or SciKit-Image in order to extract features and train a machine learning classifier to determine what type a document is.

  2. If the PDF you are analyzing is "searchable", you can get very far extracting all the text using a software like pdftotext and a Bayesian filter (same kind of algorithm used to classify SPAM).

So there is no reliable and effective method for extracting text from PDF files but you may not need one in order to solve the problem at hand (document type classification).

Solution 2

I am totally a green hand, but this script works for me:

# import packages
import PyPDF2
import re

# open the pdf file
object = PyPDF2.PdfFileReader("test.pdf")

# get number of pages
NumPages = object.getNumPages()

# define keyterms
String = "Social"

# extract text and do the search
for i in range(0, NumPages):
    PageObj = object.getPage(i)
    print("this is page " + str(i)) 
    Text = PageObj.extractText() 
    # print(Text)
    ResSearch = re.search(String, Text)
    print(ResSearch)

Solution 3

I've written extensive systems for the company I work for to convert PDF's into data for processing (invoices, settlements, scanned tickets, etc.), and @Paulo Scardine is correct--there is no completely reliable and easy way to do this. That said, the fastest, most reliable, and least-intensive way is to use pdftotext, part of the xpdf set of tools. This tool will quickly convert searchable PDF's to a text file, which you can read and parse with Python. Hint: Use the -layout argument. And by the way, not all PDF's are searchable, only those that contain text. Some PDF's contain only images with no text at all.

Solution 4

I recently started using ScraperWiki to do what you described.

Here's an example of using ScraperWiki to extract PDF data.

The scraperwiki.pdftoxml() function returns an XML structure.

You can then use BeautifulSoup to parse that into a navigatable tree.

Here's my code for -

import scraperwiki, urllib2
from bs4 import BeautifulSoup

def send_Request(url):
#Get content, regardless of whether an HTML, XML or PDF file
    pageContent = urllib2.urlopen(url)
    return pageContent

def process_PDF(fileLocation):
#Use this to get PDF, covert to XML
    pdfToProcess = send_Request(fileLocation)
    pdfToObject = scraperwiki.pdftoxml(pdfToProcess.read())
    return pdfToObject

def parse_HTML_tree(contentToParse):
#returns a navigatibale tree, which you can iterate through
    soup = BeautifulSoup(contentToParse)
    return soup

pdf = process_PDF('http://greenteapress.com/thinkstats/thinkstats.pdf')
pdfToSoup = parse_HTML_tree(pdf)
soupToArray = pdfToSoup.findAll('text')
for line in soupToArray:
    print line

This code is going to print a whole, big ugly pile of <text> tags. Each page is separated with a </page>, if that's any consolation.

If you want the content inside the <text> tags, which might include headings wrapped in <b> for example, use line.contents

If you only want each line of text, not including tags, use line.getText()

It's messy, and painful, but this will work for searchable PDF docs. So far I've found this to be accurate, but painful.

Solution 5

Here is the solution that I found it comfortable for this issue. In the text variable you get the text from PDF in order to search in it. But I have kept also the idea of spiting the text in keywords as I found on this website: https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f from were I took this solution, although making nltk was not very straightforward, it might be useful for further purposes:

import PyPDF2 
import textract

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def searchInPDF(filename, key):
    occurrences = 0
    pdfFileObj = open(filename,'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    num_pages = pdfReader.numPages
    count = 0
    text = ""
    while count < num_pages:
        pageObj = pdfReader.getPage(count)
        count +=1
        text += pageObj.extractText()
    if text != "":
       text = text
    else:
       text = textract.process(filename, method='tesseract', language='eng')
    tokens = word_tokenize(text)
    punctuation = ['(',')',';',':','[',']',',']
    stop_words = stopwords.words('english')
    keywords = [word for word in tokens if not word in stop_words and  not word in punctuation]
    for k in keywords:
        if key == k: occurrences+=1
    return occurrences 

pdf_filename = '/home/florin/Downloads/python.pdf'
search_for = 'string'
print searchInPDF (pdf_filename,search_for)
Share:
111,653
Insarov
Author by

Insarov

Updated on July 09, 2022

Comments

  • Insarov
    Insarov almost 2 years

    Problem
    I'm trying to determine what type a document is (e.g. pleading, correspondence, subpoena, etc) by searching through its text, preferably using python. All PDFs are searchable, but I haven't found a solution to parsing it with python and applying a script to search it (short of converting it to a text file first, but that could be resource-intensive for n documents).

    What I've done so far
    I've looked into pypdf, pdfminer, adobe pdf documentation, and any questions here I could find (though none seemed to directly solve this issue). PDFminer seems to have the most potential, but after reading through the documentation I'm not even sure where to begin.

    Is there a simple, effective method for reading PDF text, either by page, line, or the entire document? Or any other workarounds?

  • Hugo Moreno
    Hugo Moreno almost 11 years
    why it's the fastest and most reliable way? Any proofs?
  • Insarov
    Insarov almost 11 years
    If there's a way to convert a PDF to a text file, is there a way to do it without writing an actual new file? Something like reading a document into memory? (At least, in a way that's as straight forward as converting it?).
  • Insarov
    Insarov almost 11 years
    All of the documents were scanned in as pdfs and OCR'ed to become searchable--is that different than what you're describing?
  • MikeHunter
    MikeHunter almost 11 years
    @Insarov, I don't think so, not with pdftotext. But I may be wrong on this, you'll have to check the docs. You can do that with pyPdf and pdfminer, but they are a lot slower than pdftotext, even with pdftotext writing to the file.
  • Paulo Scardine
    Paulo Scardine almost 11 years
    @Insarov: Exactly what I'm talking about, any OCR worth its salary will have the option to output a pure text file along with the PDF file.
  • user1211
    user1211 over 7 years
    I tried using scraperwiki, I am getting The system cannot find the path specified error. @JasTonAChair any help appreciated.
  • venkat
    venkat over 6 years
    @JasTonAChair Am getting error :- BeautifulSoup([your markup]) to this: BeautifulSoup([your markup], "lxml")
  • Emma Yu
    Emma Yu about 5 years
    Hi Amey, just change the "Social" to any text you want to search!
  • Amey P Naik
    Amey P Naik about 5 years
    Hi Emma, searching is not a problem, but I need to replace this word with some other word. ex, replace the word "Social" with "friend".
  • xappppp
    xappppp almost 5 years
    If there is a sizable sample of document in relatively consistent narrative context (not necessarily in format), can we train a AI to understand it, so it can be used to read the text of the PDF files outsided the sample?
  • Paulo Scardine
    Paulo Scardine almost 5 years
    @xappppp given enough time and resources, almost anything is possible.
  • dragon788
    dragon788 almost 4 years
    @Insarov From the pdftotext docs, "If text-file is '-', the text is sent to stdout." So you could pipe this to a search with grep or similar.
  • Mast
    Mast about 3 years
    @AmeyPNaik That would be modifying a PDF, not just reading/searching a PDF. Modifying an existing PDF programmatically without lay-out and formatting problems is even more complicated.
  • Mast
    Mast about 3 years
    Note: on plenty of PDF pages this will only read the header and footer, not the rest of the pages.