How do I use pdfminer as a library
Solution 1
Here is a cleaned up version I finally produced that worked for me. The following just simply returns the string in a PDF, given its filename. I hope this saves someone time.
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
def convert_pdf(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
process_pdf(rsrcmgr, device, fp)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
This solution was valid until API changes in November 2013.
Solution 2
Here is a new solution that works with the latest version:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
Solution 3
I know it is poor taste to answer your own question, but I think I may have figured this out and I don't want anyone else to waste their time looking for a solution to my problem.
I followed the suggestion in a one of the links posted in my question and re-purposed the current pdf2txt.py script included with pdfminer. Here is the function in case it is useful to anyone else. Thanks to the user skyl for posting that answer, all I had to to was make a couple of changes to make it work with the current version of pdfminer.
This function take a pdf and creates a .txt file in the same directory with the same name.
def convert_pdf(path, outtype='txt', opts={}):
import sys
from pdfminer.pdfparser import PDFDocument, PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams
import getopt
outfile = path[:-3] + outtype
outdir = '/'.join(path.split('/')[:-1])
# debug option
debug = 0
# input option
password = ''
pagenos = set()
maxpages = 0
# output option
# ?outfile = None
# ?outtype = None
outdir = None
#layoutmode = 'normal'
codec = 'utf-8'
pageno = 1
scale = 1
showpageno = True
laparams = LAParams()
for (k, v) in opts:
if k == '-d': debug += 1
elif k == '-p': pagenos.update( int(x)-1 for x in v.split(',') )
elif k == '-m': maxpages = int(v)
elif k == '-P': password = v
elif k == '-o': outfile = v
elif k == '-n': laparams = None
elif k == '-A': laparams.all_texts = True
elif k == '-V': laparams.detect_vertical = True
elif k == '-M': laparams.char_margin = float(v)
elif k == '-L': laparams.line_margin = float(v)
elif k == '-W': laparams.word_margin = float(v)
elif k == '-F': laparams.boxes_flow = float(v)
elif k == '-Y': layoutmode = v
elif k == '-O': outdir = v
elif k == '-t': outtype = v
elif k == '-c': codec = v
elif k == '-s': scale = float(v)
#
#PDFDocument.debug = debug
#PDFParser.debug = debug
CMapDB.debug = debug
PDFResourceManager.debug = debug
PDFPageInterpreter.debug = debug
PDFDevice.debug = debug
#
rsrcmgr = PDFResourceManager()
outtype = 'text'
if outfile:
outfp = file(outfile, 'w')
else:
outfp = sys.stdout
device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)
fp = file(path, 'rb')
process_pdf(rsrcmgr, device, fp, pagenos, maxpages=maxpages, password=password,
check_extractable=True)
fp.close()
device.close()
outfp.close()
return
Solution 4
This worked for me using the most recent version of pdfminer (as of September 2014):
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
import unicodedata, codecs
from io import StringIO
def getPDFText(pdfFilenamePath):
retstr = StringIO()
parser = PDFParser(open(pdfFilenamePath,'r'))
try:
document = PDFDocument(parser)
except Exception as e:
print(pdfFilenamePath,'is not a readable pdf')
return ''
if document.is_extractable:
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr,retstr, codec='ascii' , laparams = LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
return retstr.getvalue()
else:
print(pdfFilenamePath,"Warning: could not extract text from pdf file.")
return ''
if __name__ == '__main__':
words = getPDFText(path)
Solution 5
Here's an answer that works with pdfminer.six
running python 3.6. It uses the pdfminer.high_level
module that abstracts away a lot of the underlying detail if you just want to get out the raw text from a simple PDF file.
import pdfminer
import io
def extract_raw_text(pdf_filename):
output = io.StringIO()
laparams = pdfminer.layout.LAParams() # Using the defaults seems to work fine
with open(pdf_filename, "rb") as pdffile:
pdfminer.high_level.extract_text_to_fp(pdffile, output, laparams=laparams)
return output.getvalue()
jmeich
Updated on July 05, 2022Comments
-
jmeich almost 2 years
I am trying to get text data from a pdf using pdfminer. I am able to extract this data to a .txt file successfully with the pdfminer command line tool pdf2txt.py. I currently do this and then use a python script to clean up the .txt file. I would like to incorporate the pdf extract process into the script and save myself a step.
I thought I was on to something when I found this link, but I didn't have success with any of the solutions. Perhaps the function listed there needs to be updated again because I am using a newer version of pdfminer.
I also tried the function shown here, but it also did not work.
Another approach I tried was to call the script within a script using
os.system
. This was also unsuccessful.I am using Python version 2.7.1 and pdfminer version 20110227.
-
oers about 13 yearsit is okay to answer your own questions, don't worry.
-
Jesvin Jose almost 13 yearsI also repurposed that code. It just could be somewhat slimmer
-
disc0dancer about 12 yearsIt's quite desirable to answer your own question, because of the reason you mentioned
-
scubbo over 10 yearsJust FYI, this no longer works on the current version of pdfminer - I get "ImportError: cannot import name process_pdf"
-
Michael over 10 yearsgreat! finally a running solution without using
process_pdf
:) I tried that with thesimple1.pdf
from the official git repository -
AfromanJ over 10 yearsThis is working for version
pdfminer 20131113
. Thanks. -
jason almost 10 years@czw. How do you use this with the built in options? Is there a way to specify which page number to convert?
-
OWADVL almost 10 yearsif you want to get a html output just add HTMLConverter to imports and in the function, instead of TextConverter put HTMLConverter. Simple as that.
-
Abdul Majeed over 9 yearsYou rock man it is also working with pdfminer latest version(20140328): import pdfminer print pdfminer.__version__ 20140328
-
polkattt about 9 yearshave you given this guy a try? [link]pypi.python.org/pypi/pdfminer3k ... doesnt look like its really supported anymore, but it gave me a good jump off point. in the end, i used their pdf2txt module mixed a few methods from pypdf2 for my specific use case.
-
ViennaMike almost 9 yearsGreat with one minor edit: DON'T use str as a variable name. str is a python function.
-
benzkji over 8 yearsover here is a working solution: stackoverflow.com/questions/26494211/… (as of mid november '15)
-
Andi Giga almost 8 yearsI get
ImportError: No module named pdfpage
-
Ando Jurai over 5 yearsprocess_pdf was merely replaced by PDFPageInterpreter
-
moli about 4 yearsStringIO is also gone and you can use "from io import StringIO" instead
-
Harsh Vardhan over 3 yearswhat to do when it says the pdf is not readable?
-
MTALY almost 3 yearsHi @Pieter, could you please look at this question: stackoverflow.com/questions/68614884/…