Text Scraping a PDF with Python (pdfquery)

14,724

For extracting text from a PDF file, my favorite tool is pdftotext.

Using the -layout option, you basically get a plain text back, which is relatively easy to manipulate using Python.

Example below:

"""Extract text from PDF files.

Requires pdftotext from the poppler utilities.
On unix/linux install them using your favorite package manager.

Binaries for ms-windows can be found at;
1) http://blog.alivate.com.au/poppler-windows/
2) https://sourceforge.net/projects/poppler-win32/
"""

import subprocess


def pdftotext(pdf, page=None):
    """Retrieve all text from a PDF file.

    Arguments:
        pdf Path of the file to read.
        page: Number of the page to read. If None, read all the pages.

    Returns:
        A list of lines of text.
    """
    if page is None:
        args = ['pdftotext', '-layout', '-q', pdf, '-']
    else:
        args = ['pdftotext', '-f', str(page), '-l', str(page), '-layout',
                '-q', pdf, '-']
    try:
        txt = subprocess.check_output(args, universal_newlines=True)
        lines = txt.splitlines()
    except subprocess.CalledProcessError:
        lines = []
    return lines
Share:
14,724

Related videos on Youtube

Freya
Author by

Freya

Updated on June 13, 2022

Comments

  • Freya
    Freya almost 2 years

    I need to scrape some PDF files to extract the following text information:

    enter image description here

    I have attempted to do this using pdfquery, by working off an example I found on Reddit (see first post): https://www.reddit.com/r/Python/comments/4bnjha/scraping_pdf_files_with_python/

    I wanted to test it out by getting the license numbers to start off with. I went into the generated "xmltree" file, found the first license number and got the x0,y0,x1,y1 co-ordinates in the LTTextLineHorizontal element.

    import pdfquery
    from lxml import etree
    
    
    PDF_FILE = 'C:\\TEMP\\ad-4070-20-september-2018.pdf'
    
    pdf = pdfquery.PDFQuery(PDF_FILE)
    pdf.load(4,5)
    
    with open('xmltree.xml','wb') as f:
        f.write(etree.tostring(pdf.tree, pretty_print=True))
    
    product_info = []
    page_count = len(pdf._pages)
    for pg in range(page_count):
        data = pdf.extract([
            ('with_parent', 'LTPage[pageid="{}"]'.format(pg+1)),
            ('with_formatter', None),
            ('product_name', 'LTTextLineHorizontal:in_bbox("89.904, 757.502, 265.7, 770.83")'),
            ('product_details', 'LTTextLineHorizontal:in_bbox("223, 100, 737, 1114")'),
        ])
        for ix, pn in enumerate(sorted([d for d in data['product_name'] if d.text.strip()], key=lambda x: x.get('y0'), reverse=True)):
            product_info.append({'Manufacturer': pn.text.strip(), 'page': pg, 'y_start': float(pn.get('y1')), 'y_end': float(pn.get('y1'))-150})
            # if this is not the first product on the page, update the previous product's y_end with a
            # value slightly greater than this product's y coordinate start
            if ix > 0:
                product_info[-2]['y_end'] = float(pn.get('y0'))
        # for every product found on this page, find the detail information that falls between the
        # y coordinates belonging to the product
        for product in [p for p in product_info if p['page'] == pg]:
            details = []
            for d in sorted([d for d in data['product_details'] if d.text.strip()], key=lambda x: x.get('y0'), reverse=True):
                if  product['y_start'] > float(d.get('y0')) > product['y_end']:
                    details.append(d.text.strip())
            product['Details'] = ' '.join(details)
    pdf.file.close()
    
    for p in product_info:
        print('Manufacturer: {}\r\nDetail Info:{}...\r\n\r\n'.format(p['Manufacturer'], p['Details'][0:100]))
    

    However, when I run it, it doesn't print anything. There are no errors, the XML file generates fine, and I'm getting the co-ordinates straight from the XML file so there should be no issue. What am I doing wrong?