Text Scraping a PDF with Python (pdfquery)
For extracting text from a PDF file, my favorite tool is pdftotext
.
Using the -layout
option, you basically get a plain text back, which is relatively easy to manipulate using Python.
Example below:
"""Extract text from PDF files.
Requires pdftotext from the poppler utilities.
On unix/linux install them using your favorite package manager.
Binaries for ms-windows can be found at;
1) http://blog.alivate.com.au/poppler-windows/
2) https://sourceforge.net/projects/poppler-win32/
"""
import subprocess
def pdftotext(pdf, page=None):
"""Retrieve all text from a PDF file.
Arguments:
pdf Path of the file to read.
page: Number of the page to read. If None, read all the pages.
Returns:
A list of lines of text.
"""
if page is None:
args = ['pdftotext', '-layout', '-q', pdf, '-']
else:
args = ['pdftotext', '-f', str(page), '-l', str(page), '-layout',
'-q', pdf, '-']
try:
txt = subprocess.check_output(args, universal_newlines=True)
lines = txt.splitlines()
except subprocess.CalledProcessError:
lines = []
return lines
Related videos on Youtube
Freya
Updated on June 13, 2022Comments
-
Freya almost 2 years
I need to scrape some PDF files to extract the following text information:
I have attempted to do this using pdfquery, by working off an example I found on Reddit (see first post): https://www.reddit.com/r/Python/comments/4bnjha/scraping_pdf_files_with_python/
I wanted to test it out by getting the license numbers to start off with. I went into the generated "xmltree" file, found the first license number and got the x0,y0,x1,y1 co-ordinates in the LTTextLineHorizontal element.
import pdfquery from lxml import etree PDF_FILE = 'C:\\TEMP\\ad-4070-20-september-2018.pdf' pdf = pdfquery.PDFQuery(PDF_FILE) pdf.load(4,5) with open('xmltree.xml','wb') as f: f.write(etree.tostring(pdf.tree, pretty_print=True)) product_info = [] page_count = len(pdf._pages) for pg in range(page_count): data = pdf.extract([ ('with_parent', 'LTPage[pageid="{}"]'.format(pg+1)), ('with_formatter', None), ('product_name', 'LTTextLineHorizontal:in_bbox("89.904, 757.502, 265.7, 770.83")'), ('product_details', 'LTTextLineHorizontal:in_bbox("223, 100, 737, 1114")'), ]) for ix, pn in enumerate(sorted([d for d in data['product_name'] if d.text.strip()], key=lambda x: x.get('y0'), reverse=True)): product_info.append({'Manufacturer': pn.text.strip(), 'page': pg, 'y_start': float(pn.get('y1')), 'y_end': float(pn.get('y1'))-150}) # if this is not the first product on the page, update the previous product's y_end with a # value slightly greater than this product's y coordinate start if ix > 0: product_info[-2]['y_end'] = float(pn.get('y0')) # for every product found on this page, find the detail information that falls between the # y coordinates belonging to the product for product in [p for p in product_info if p['page'] == pg]: details = [] for d in sorted([d for d in data['product_details'] if d.text.strip()], key=lambda x: x.get('y0'), reverse=True): if product['y_start'] > float(d.get('y0')) > product['y_end']: details.append(d.text.strip()) product['Details'] = ' '.join(details) pdf.file.close() for p in product_info: print('Manufacturer: {}\r\nDetail Info:{}...\r\n\r\n'.format(p['Manufacturer'], p['Details'][0:100]))
However, when I run it, it doesn't print anything. There are no errors, the XML file generates fine, and I'm getting the co-ordinates straight from the XML file so there should be no issue. What am I doing wrong?