How do I use pdfminer as a library

python pdf pdfminer

84,155

Solution 1

Here is a cleaned up version I finally produced that worked for me. The following just simply returns the string in a PDF, given its filename. I hope this saves someone time.

from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
def convert_pdf(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    process_pdf(rsrcmgr, device, fp)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str

This solution was valid until API changes in November 2013.

Solution 2

Here is a new solution that works with the latest version:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str

Solution 3

I know it is poor taste to answer your own question, but I think I may have figured this out and I don't want anyone else to waste their time looking for a solution to my problem.

I followed the suggestion in a one of the links posted in my question and re-purposed the current pdf2txt.py script included with pdfminer. Here is the function in case it is useful to anyone else. Thanks to the user skyl for posting that answer, all I had to to was make a couple of changes to make it work with the current version of pdfminer.

This function take a pdf and creates a .txt file in the same directory with the same name.

def convert_pdf(path, outtype='txt', opts={}):
import sys
from pdfminer.pdfparser import PDFDocument, PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams
import getopt
outfile = path[:-3] + outtype
outdir = '/'.join(path.split('/')[:-1])
# debug option
debug = 0
# input option
password = ''
pagenos = set()
maxpages = 0
# output option
# ?outfile = None
# ?outtype = None
outdir = None
#layoutmode = 'normal'
codec = 'utf-8'
pageno = 1
scale = 1
showpageno = True
laparams = LAParams()
for (k, v) in opts:
    if k == '-d': debug += 1
    elif k == '-p': pagenos.update( int(x)-1 for x in v.split(',') )
    elif k == '-m': maxpages = int(v)
    elif k == '-P': password = v
    elif k == '-o': outfile = v
    elif k == '-n': laparams = None
    elif k == '-A': laparams.all_texts = True
    elif k == '-V': laparams.detect_vertical = True
    elif k == '-M': laparams.char_margin = float(v)
    elif k == '-L': laparams.line_margin = float(v)
    elif k == '-W': laparams.word_margin = float(v)
    elif k == '-F': laparams.boxes_flow = float(v)
    elif k == '-Y': layoutmode = v
    elif k == '-O': outdir = v
    elif k == '-t': outtype = v
    elif k == '-c': codec = v
    elif k == '-s': scale = float(v)
#
#PDFDocument.debug = debug
#PDFParser.debug = debug
CMapDB.debug = debug
PDFResourceManager.debug = debug
PDFPageInterpreter.debug = debug
PDFDevice.debug = debug
#
rsrcmgr = PDFResourceManager()
outtype = 'text'
if outfile:
    outfp = file(outfile, 'w')
else:
    outfp = sys.stdout
device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)
fp = file(path, 'rb')
process_pdf(rsrcmgr, device, fp, pagenos, maxpages=maxpages, password=password,
                check_extractable=True)
fp.close()
device.close()
outfp.close()
return

Solution 4

This worked for me using the most recent version of pdfminer (as of September 2014):

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
import unicodedata, codecs
from io import StringIO
def getPDFText(pdfFilenamePath):
    retstr = StringIO()
    parser = PDFParser(open(pdfFilenamePath,'r'))
    try:
        document = PDFDocument(parser)
    except Exception as e:
        print(pdfFilenamePath,'is not a readable pdf')
        return ''
    if document.is_extractable:
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr,retstr, codec='ascii' , laparams = LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(document):
            interpreter.process_page(page)
        return retstr.getvalue()
    else:
        print(pdfFilenamePath,"Warning: could not extract text from pdf file.")
        return ''
if __name__ == '__main__':
    words = getPDFText(path)

Solution 5

Here's an answer that works with pdfminer.six running python 3.6. It uses the pdfminer.high_level module that abstracts away a lot of the underlying detail if you just want to get out the raw text from a simple PDF file.

import pdfminer
import io
def extract_raw_text(pdf_filename):
    output = io.StringIO()
    laparams = pdfminer.layout.LAParams() # Using the defaults seems to work fine
    with open(pdf_filename, "rb") as pdffile:
        pdfminer.high_level.extract_text_to_fp(pdffile, output, laparams=laparams)
    return output.getvalue()

View more solutions

84,155

Author by

jmeich

Updated on July 05, 2022

Comments

jmeich 4 months

I am trying to get text data from a pdf using pdfminer. I am able to extract this data to a .txt file successfully with the pdfminer command line tool pdf2txt.py. I currently do this and then use a python script to clean up the .txt file. I would like to incorporate the pdf extract process into the script and save myself a step.

I thought I was on to something when I found this link, but I didn't have success with any of the solutions. Perhaps the function listed there needs to be updated again because I am using a newer version of pdfminer.

I also tried the function shown here, but it also did not work.

Another approach I tried was to call the script within a script using os.system. This was also unsuccessful.

I am using Python version 2.7.1 and pdfminer version 20110227.
oers over 11 years

it is okay to answer your own questions, don't worry.
Jesvin Jose over 11 years

I also repurposed that code. It just could be somewhat slimmer
disc0dancer over 10 years

It's quite desirable to answer your own question, because of the reason you mentioned
scubbo almost 9 years

Just FYI, this no longer works on the current version of pdfminer - I get "ImportError: cannot import name process_pdf"
Michael almost 9 years

great! finally a running solution without using process_pdf :) I tried that with the simple1.pdf from the official git repository
AfromanJ almost 9 years

This is working for version pdfminer 20131113. Thanks.
jason over 8 years

@czw. How do you use this with the built in options? Is there a way to specify which page number to convert?
OWADVL over 8 years

if you want to get a html output just add HTMLConverter to imports and in the function, instead of TextConverter put HTMLConverter. Simple as that.
Abdul Majeed about 8 years

You rock man it is also working with pdfminer latest version(20140328): import pdfminer print pdfminer.__version__ 20140328
polkattt over 7 years

have you given this guy a try? [link]pypi.python.org/pypi/pdfminer3k ... doesnt look like its really supported anymore, but it gave me a good jump off point. in the end, i used their pdf2txt module mixed a few methods from pypdf2 for my specific use case.
ViennaMike over 7 years

Great with one minor edit: DON'T use str as a variable name. str is a python function.
benzkji almost 7 years

over here is a working solution: stackoverflow.com/questions/26494211/… (as of mid november '15)
Andi Giga over 6 years

I get ImportError: No module named pdfpage
Ando Jurai almost 4 years

process_pdf was merely replaced by PDFPageInterpreter
moli over 2 years

StringIO is also gone and you can use "from io import StringIO" instead
Harsh Vardhan over 1 year

what to do when it says the pdf is not readable?
MTALY over 1 year

Hi @Pieter, could you please look at this question: stackoverflow.com/questions/68614884/…