How do I use pdfminer as a library

84,155

Solution 1

Here is a cleaned up version I finally produced that worked for me. The following just simply returns the string in a PDF, given its filename. I hope this saves someone time.

from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO

def convert_pdf(path):

    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    fp = file(path, 'rb')
    process_pdf(rsrcmgr, device, fp)
    fp.close()
    device.close()

    str = retstr.getvalue()
    retstr.close()
    return str

This solution was valid until API changes in November 2013.

Solution 2

Here is a new solution that works with the latest version:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str

Solution 3

I know it is poor taste to answer your own question, but I think I may have figured this out and I don't want anyone else to waste their time looking for a solution to my problem.

I followed the suggestion in a one of the links posted in my question and re-purposed the current pdf2txt.py script included with pdfminer. Here is the function in case it is useful to anyone else. Thanks to the user skyl for posting that answer, all I had to to was make a couple of changes to make it work with the current version of pdfminer.

This function take a pdf and creates a .txt file in the same directory with the same name.

def convert_pdf(path, outtype='txt', opts={}):
import sys
from pdfminer.pdfparser import PDFDocument, PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams
import getopt

outfile = path[:-3] + outtype
outdir = '/'.join(path.split('/')[:-1])

# debug option
debug = 0
# input option
password = ''
pagenos = set()
maxpages = 0
# output option
# ?outfile = None
# ?outtype = None
outdir = None
#layoutmode = 'normal'
codec = 'utf-8'
pageno = 1
scale = 1
showpageno = True
laparams = LAParams()
for (k, v) in opts:
    if k == '-d': debug += 1
    elif k == '-p': pagenos.update( int(x)-1 for x in v.split(',') )
    elif k == '-m': maxpages = int(v)
    elif k == '-P': password = v
    elif k == '-o': outfile = v
    elif k == '-n': laparams = None
    elif k == '-A': laparams.all_texts = True
    elif k == '-V': laparams.detect_vertical = True
    elif k == '-M': laparams.char_margin = float(v)
    elif k == '-L': laparams.line_margin = float(v)
    elif k == '-W': laparams.word_margin = float(v)
    elif k == '-F': laparams.boxes_flow = float(v)
    elif k == '-Y': layoutmode = v
    elif k == '-O': outdir = v
    elif k == '-t': outtype = v
    elif k == '-c': codec = v
    elif k == '-s': scale = float(v)
#
#PDFDocument.debug = debug
#PDFParser.debug = debug
CMapDB.debug = debug
PDFResourceManager.debug = debug
PDFPageInterpreter.debug = debug
PDFDevice.debug = debug
#
rsrcmgr = PDFResourceManager()

outtype = 'text'

if outfile:
    outfp = file(outfile, 'w')
else:
    outfp = sys.stdout
device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)


fp = file(path, 'rb')
process_pdf(rsrcmgr, device, fp, pagenos, maxpages=maxpages, password=password,
                check_extractable=True)
fp.close()
device.close()
outfp.close()
return

Solution 4

This worked for me using the most recent version of pdfminer (as of September 2014):

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
import unicodedata, codecs
from io import StringIO

def getPDFText(pdfFilenamePath):
    retstr = StringIO()
    parser = PDFParser(open(pdfFilenamePath,'r'))
    try:
        document = PDFDocument(parser)
    except Exception as e:
        print(pdfFilenamePath,'is not a readable pdf')
        return ''
    if document.is_extractable:
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr,retstr, codec='ascii' , laparams = LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(document):
            interpreter.process_page(page)
        return retstr.getvalue()
    else:
        print(pdfFilenamePath,"Warning: could not extract text from pdf file.")
        return ''

if __name__ == '__main__':
    words = getPDFText(path)

Solution 5

Here's an answer that works with pdfminer.six running python 3.6. It uses the pdfminer.high_level module that abstracts away a lot of the underlying detail if you just want to get out the raw text from a simple PDF file.

import pdfminer
import io

def extract_raw_text(pdf_filename):
    output = io.StringIO()
    laparams = pdfminer.layout.LAParams() # Using the defaults seems to work fine

    with open(pdf_filename, "rb") as pdffile:
        pdfminer.high_level.extract_text_to_fp(pdffile, output, laparams=laparams)

    return output.getvalue()
Share:
84,155
jmeich
Author by

jmeich

Updated on July 05, 2022

Comments

  • jmeich
    jmeich almost 2 years

    I am trying to get text data from a pdf using pdfminer. I am able to extract this data to a .txt file successfully with the pdfminer command line tool pdf2txt.py. I currently do this and then use a python script to clean up the .txt file. I would like to incorporate the pdf extract process into the script and save myself a step.

    I thought I was on to something when I found this link, but I didn't have success with any of the solutions. Perhaps the function listed there needs to be updated again because I am using a newer version of pdfminer.

    I also tried the function shown here, but it also did not work.

    Another approach I tried was to call the script within a script using os.system. This was also unsuccessful.

    I am using Python version 2.7.1 and pdfminer version 20110227.

  • oers
    oers about 13 years
    it is okay to answer your own questions, don't worry.
  • Jesvin Jose
    Jesvin Jose almost 13 years
    I also repurposed that code. It just could be somewhat slimmer
  • disc0dancer
    disc0dancer about 12 years
    It's quite desirable to answer your own question, because of the reason you mentioned
  • scubbo
    scubbo over 10 years
    Just FYI, this no longer works on the current version of pdfminer - I get "ImportError: cannot import name process_pdf"
  • Michael
    Michael over 10 years
    great! finally a running solution without using process_pdf :) I tried that with the simple1.pdf from the official git repository
  • AfromanJ
    AfromanJ over 10 years
    This is working for version pdfminer 20131113. Thanks.
  • jason
    jason almost 10 years
    @czw. How do you use this with the built in options? Is there a way to specify which page number to convert?
  • OWADVL
    OWADVL almost 10 years
    if you want to get a html output just add HTMLConverter to imports and in the function, instead of TextConverter put HTMLConverter. Simple as that.
  • Abdul Majeed
    Abdul Majeed over 9 years
    You rock man it is also working with pdfminer latest version(20140328): import pdfminer print pdfminer.__version__ 20140328
  • polkattt
    polkattt about 9 years
    have you given this guy a try? [link]pypi.python.org/pypi/pdfminer3k ... doesnt look like its really supported anymore, but it gave me a good jump off point. in the end, i used their pdf2txt module mixed a few methods from pypdf2 for my specific use case.
  • ViennaMike
    ViennaMike almost 9 years
    Great with one minor edit: DON'T use str as a variable name. str is a python function.
  • benzkji
    benzkji over 8 years
    over here is a working solution: stackoverflow.com/questions/26494211/… (as of mid november '15)
  • Andi Giga
    Andi Giga almost 8 years
    I get ImportError: No module named pdfpage
  • Ando Jurai
    Ando Jurai over 5 years
    process_pdf was merely replaced by PDFPageInterpreter
  • moli
    moli about 4 years
    StringIO is also gone and you can use "from io import StringIO" instead
  • Harsh Vardhan
    Harsh Vardhan over 3 years
    what to do when it says the pdf is not readable?
  • MTALY
    MTALY almost 3 years
    Hi @Pieter, could you please look at this question: stackoverflow.com/questions/68614884/…