PDF miner - extract font size?

10,576

Solution 1

This task was puzzling me for a long time. Next to extracting fonts-information I also wanted to run this code in a python script.

Hower, today I was able to solve it. Below I wrote a script that calls the pdf2txt.py script from the command line and then extracts the font-information form the parsed PDF and newly created html file.

import os

pathToScript = r'path\to\script\pdf2txt.py'
pathPDFinput = os.path.join(path\to\file, 'test.pdf')
pathHTMLoutput = os.path.join(path\to\file, 'test.html')

# call the pdf2txt.py from the command line
os.system('python {} -o {} -S {} -t html'.format(pathToScript, pathHTMLoutput, pathPDFinput))
  

Extract the font-size for every html tag:

# credits to akash karothiya: 
# https://stackoverflow.com/questions/39012739/need-to-extract-all-the-font-sizes-and-the-text-using-beautifulsoup/39015419#39015419

import re
import pandas as pd
from bs4 import BeautifulSoup

# open the html file
html = open(pathHTMLoutput, 'r')
soup = BeautifulSoup(html)

font_spans = [data for data in soup.select('span') if 'font-size' in str(data)]
output = []
for span in font_spans:
    fonts_size = re.search(r'(?is)(font-size:)(.*?)(px)', str(span.get('style'))).group(2)
    fonts_family = re.search(r'(?is)(font-family:)(.*?)(;)', str(span.get('style'))).group(2)

    # split fonts_family into fonts-type and fonts-style
    try:
        fonts_type = fonts_family.strip().split(',')[0]
        fonts_style = fonts_family.strip().split(',')[1]
    except IndexError:
        fonts_type = fonts_family.strip()
        fonts_style = None

    output.append(
        (str(i.text).strip(), fonts_size.strip(), fonts_type, fonts_style)
    )

# create dataframe
df = pd.DataFrame(output, columns = ['text', 'fonts-size', 'fonts-type', 'fonts-style'])

Solution 2

Try specifying the file output type with the -t flag:

pdf2txt.py -o output.html -t html samples/CentolaCV.pdf

That should return an html file with the style attributes font-family and font-size.

EDIT: actually, it looks like the output ending can specify the outfile type without the -t flag. Can you link to the pdf file that you're trying to extract font style from?

Solution 3

Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.

Nowadays, pdfminer.six has multiple API's to extract text and information from a PDF. For programmatically extracting information I would advice to use extract_pages(). This allows you to inspect all of the elements on a page, ordered in a meaningful hierarchy created by the layout algorithm.

The following example is a pythonic way of showing all the elements in the hierachy. It uses the simple1.pdf from the samples directory of pdfminer.six.

from pathlib import Path
from typing import Iterable, Any

from pdfminer.high_level import extract_pages


def show_ltitem_hierarchy(o: Any, depth=0):
    """Show location and text of LTItem and all its descendants"""
    if depth == 0:
        print('element                        fontname             text')
        print('------------------------------ -------------------- -----')

    print(
        f'{get_indented_name(o, depth):<30.30s} '
        f'{get_optional_fontinfo(o):<20.20s} '
        f'{get_optional_text(o)}'
    )

    if isinstance(o, Iterable):
        for i in o:
            show_ltitem_hierarchy(i, depth=depth + 1)


def get_indented_name(o: Any, depth: int) -> str:
    """Indented name of class"""
    return '  ' * depth + o.__class__.__name__


def get_optional_fontinfo(o: Any) -> str:
    """Font info of LTChar if available, otherwise empty string"""
    if hasattr(o, 'fontname') and hasattr(o, 'size'):
        return f'{o.fontname} {round(o.size)}pt'
    return ''


def get_optional_text(o: Any) -> str:
    """Text of LTItem if available, otherwise empty string"""
    if hasattr(o, 'get_text'):
        return o.get_text().strip()
    return ''


path = Path('~/Downloads/simple1.pdf').expanduser()
pages = extract_pages(path)
show_ltitem_hierarchy(pages)

The output shows the different elements in the hierarchy, the font name and size if available and the text that this element contains.

element                        fontname             text
------------------------------ -------------------- -----
generator                                           
  LTPage                                            
    LTTextBoxHorizontal                             Hello
      LTTextLineHorizontal                          Hello
        LTChar                 Helvetica 24pt       H
        LTChar                 Helvetica 24pt       e
        LTChar                 Helvetica 24pt       l
        LTChar                 Helvetica 24pt       l
        LTChar                 Helvetica 24pt       o
        LTChar                 Helvetica 24pt       
        LTAnno                                      
    LTTextBoxHorizontal                             World
      LTTextLineHorizontal                          World
        LTChar                 Helvetica 24pt       W
        LTChar                 Helvetica 24pt       o
        LTChar                 Helvetica 24pt       r
        LTChar                 Helvetica 24pt       l
        LTChar                 Helvetica 24pt       d
        LTAnno                                      
    LTTextBoxHorizontal                             Hello
      LTTextLineHorizontal                          Hello
        LTChar                 Helvetica 24pt       H
        LTChar                 Helvetica 24pt       e
        LTChar                 Helvetica 24pt       l
        LTChar                 Helvetica 24pt       l
        LTChar                 Helvetica 24pt       o
        LTChar                 Helvetica 24pt       
        LTAnno                                      
    LTTextBoxHorizontal                             World
      LTTextLineHorizontal                          World
        LTChar                 Helvetica 24pt       W
        LTChar                 Helvetica 24pt       o
        LTChar                 Helvetica 24pt       r
        LTChar                 Helvetica 24pt       l
        LTChar                 Helvetica 24pt       d
        LTAnno                                      
    LTTextBoxHorizontal                             H e l l o
      LTTextLineHorizontal                          H e l l o
        LTChar                 Helvetica 24pt       H
        LTAnno                                      
        LTChar                 Helvetica 24pt       e
        LTAnno                                      
        LTChar                 Helvetica 24pt       l
        LTAnno                                      
        LTChar                 Helvetica 24pt       l
        LTAnno                                      
        LTChar                 Helvetica 24pt       o
        LTAnno                                      
        LTChar                 Helvetica 24pt       
        LTAnno                                      
    LTTextBoxHorizontal                             W o r l d
      LTTextLineHorizontal                          W o r l d
        LTChar                 Helvetica 24pt       W
        LTAnno                                      
        LTChar                 Helvetica 24pt       o
        LTAnno                                      
        LTChar                 Helvetica 24pt       r
        LTAnno                                      
        LTChar                 Helvetica 24pt       l
        LTAnno                                      
        LTChar                 Helvetica 24pt       d
        LTAnno                                      
    LTTextBoxHorizontal                             H e l l o
      LTTextLineHorizontal                          H e l l o
        LTChar                 Helvetica 24pt       H
        LTAnno                                      
        LTChar                 Helvetica 24pt       e
        LTAnno                                      
        LTChar                 Helvetica 24pt       l
        LTAnno                                      
        LTChar                 Helvetica 24pt       l
        LTAnno                                      
        LTChar                 Helvetica 24pt       o
        LTAnno                                      
        LTChar                 Helvetica 24pt       
        LTAnno                                      
    LTTextBoxHorizontal                             W o r l d
      LTTextLineHorizontal                          W o r l d
        LTChar                 Helvetica 24pt       W
        LTAnno                                      
        LTChar                 Helvetica 24pt       o
        LTAnno                                      
        LTChar                 Helvetica 24pt       r
        LTAnno                                      
        LTChar                 Helvetica 24pt       l
        LTAnno                                      
        LTChar                 Helvetica 24pt       d
        LTAnno                                      

(Similar answer here, here and here , I'll try to keep them in sync.)

Share:
10,576
user3314418
Author by

user3314418

python beginner. thanks for your help! I apologize for any 'stupid' questions in advance.

Updated on June 04, 2022

Comments

  • user3314418
    user3314418 almost 2 years

    I'm curious if it's possible to use pdfminer to extract font size. I think this would be helpful for separating out different sections. I know there's the discussion below, but I'm curious if it's possible to use pdfminer

    Extract text from PDF in respect to formatting (font size, type etc)

    the pdfminer documentation says it's possible http://www.unixuser.org/~euske/python/pdfminer/

    but when i type in he following into the command line, i just get a plain text document. I don't see any font information.

    pdf2txt.py -o output.html samples/CentolaCV.pdf
    

    e.g...

    2008-13  Assistant Professor, Sloan School of Management, M.I.T.  
    
    2006-08   Robert Wood Johnson Scholar in Health Policy, Harvard University 
    
    2001-02   Visiting Scholar, The Brookings Institution
    
  • Imane E.
    Imane E. over 6 years
    Is it possible to get font-weight too? I need the text in bold.