PDF miner - extract font size?
Solution 1
This task was puzzling me for a long time. Next to extracting fonts-information I also wanted to run this code in a python script.
Hower, today I was able to solve it. Below I wrote a script that calls the pdf2txt.py
script from the command line and then extracts the font-information form the parsed PDF and newly created html file.
import os
pathToScript = r'path\to\script\pdf2txt.py'
pathPDFinput = os.path.join(path\to\file, 'test.pdf')
pathHTMLoutput = os.path.join(path\to\file, 'test.html')
# call the pdf2txt.py from the command line
os.system('python {} -o {} -S {} -t html'.format(pathToScript, pathHTMLoutput, pathPDFinput))
Extract the font-size for every html tag:
# credits to akash karothiya:
# https://stackoverflow.com/questions/39012739/need-to-extract-all-the-font-sizes-and-the-text-using-beautifulsoup/39015419#39015419
import re
import pandas as pd
from bs4 import BeautifulSoup
# open the html file
html = open(pathHTMLoutput, 'r')
soup = BeautifulSoup(html)
font_spans = [data for data in soup.select('span') if 'font-size' in str(data)]
output = []
for span in font_spans:
fonts_size = re.search(r'(?is)(font-size:)(.*?)(px)', str(span.get('style'))).group(2)
fonts_family = re.search(r'(?is)(font-family:)(.*?)(;)', str(span.get('style'))).group(2)
# split fonts_family into fonts-type and fonts-style
try:
fonts_type = fonts_family.strip().split(',')[0]
fonts_style = fonts_family.strip().split(',')[1]
except IndexError:
fonts_type = fonts_family.strip()
fonts_style = None
output.append(
(str(i.text).strip(), fonts_size.strip(), fonts_type, fonts_style)
)
# create dataframe
df = pd.DataFrame(output, columns = ['text', 'fonts-size', 'fonts-type', 'fonts-style'])
Solution 2
Try specifying the file output type with the -t
flag:
pdf2txt.py -o output.html -t html samples/CentolaCV.pdf
That should return an html file with the style attributes font-family and font-size.
EDIT: actually, it looks like the output ending can specify the outfile type without the -t
flag. Can you link to the pdf file that you're trying to extract font style from?
Solution 3
Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.
Nowadays, pdfminer.six has multiple API's to extract text and information from a PDF. For programmatically extracting information I would advice to use extract_pages()
. This allows you to inspect all of the elements on a page, ordered in a meaningful hierarchy created by the layout algorithm.
The following example is a pythonic way of showing all the elements in the hierachy. It uses the simple1.pdf from the samples directory of pdfminer.six.
from pathlib import Path
from typing import Iterable, Any
from pdfminer.high_level import extract_pages
def show_ltitem_hierarchy(o: Any, depth=0):
"""Show location and text of LTItem and all its descendants"""
if depth == 0:
print('element fontname text')
print('------------------------------ -------------------- -----')
print(
f'{get_indented_name(o, depth):<30.30s} '
f'{get_optional_fontinfo(o):<20.20s} '
f'{get_optional_text(o)}'
)
if isinstance(o, Iterable):
for i in o:
show_ltitem_hierarchy(i, depth=depth + 1)
def get_indented_name(o: Any, depth: int) -> str:
"""Indented name of class"""
return ' ' * depth + o.__class__.__name__
def get_optional_fontinfo(o: Any) -> str:
"""Font info of LTChar if available, otherwise empty string"""
if hasattr(o, 'fontname') and hasattr(o, 'size'):
return f'{o.fontname} {round(o.size)}pt'
return ''
def get_optional_text(o: Any) -> str:
"""Text of LTItem if available, otherwise empty string"""
if hasattr(o, 'get_text'):
return o.get_text().strip()
return ''
path = Path('~/Downloads/simple1.pdf').expanduser()
pages = extract_pages(path)
show_ltitem_hierarchy(pages)
The output shows the different elements in the hierarchy, the font name and size if available and the text that this element contains.
element fontname text
------------------------------ -------------------- -----
generator
LTPage
LTTextBoxHorizontal Hello
LTTextLineHorizontal Hello
LTChar Helvetica 24pt H
LTChar Helvetica 24pt e
LTChar Helvetica 24pt l
LTChar Helvetica 24pt l
LTChar Helvetica 24pt o
LTChar Helvetica 24pt
LTAnno
LTTextBoxHorizontal World
LTTextLineHorizontal World
LTChar Helvetica 24pt W
LTChar Helvetica 24pt o
LTChar Helvetica 24pt r
LTChar Helvetica 24pt l
LTChar Helvetica 24pt d
LTAnno
LTTextBoxHorizontal Hello
LTTextLineHorizontal Hello
LTChar Helvetica 24pt H
LTChar Helvetica 24pt e
LTChar Helvetica 24pt l
LTChar Helvetica 24pt l
LTChar Helvetica 24pt o
LTChar Helvetica 24pt
LTAnno
LTTextBoxHorizontal World
LTTextLineHorizontal World
LTChar Helvetica 24pt W
LTChar Helvetica 24pt o
LTChar Helvetica 24pt r
LTChar Helvetica 24pt l
LTChar Helvetica 24pt d
LTAnno
LTTextBoxHorizontal H e l l o
LTTextLineHorizontal H e l l o
LTChar Helvetica 24pt H
LTAnno
LTChar Helvetica 24pt e
LTAnno
LTChar Helvetica 24pt l
LTAnno
LTChar Helvetica 24pt l
LTAnno
LTChar Helvetica 24pt o
LTAnno
LTChar Helvetica 24pt
LTAnno
LTTextBoxHorizontal W o r l d
LTTextLineHorizontal W o r l d
LTChar Helvetica 24pt W
LTAnno
LTChar Helvetica 24pt o
LTAnno
LTChar Helvetica 24pt r
LTAnno
LTChar Helvetica 24pt l
LTAnno
LTChar Helvetica 24pt d
LTAnno
LTTextBoxHorizontal H e l l o
LTTextLineHorizontal H e l l o
LTChar Helvetica 24pt H
LTAnno
LTChar Helvetica 24pt e
LTAnno
LTChar Helvetica 24pt l
LTAnno
LTChar Helvetica 24pt l
LTAnno
LTChar Helvetica 24pt o
LTAnno
LTChar Helvetica 24pt
LTAnno
LTTextBoxHorizontal W o r l d
LTTextLineHorizontal W o r l d
LTChar Helvetica 24pt W
LTAnno
LTChar Helvetica 24pt o
LTAnno
LTChar Helvetica 24pt r
LTAnno
LTChar Helvetica 24pt l
LTAnno
LTChar Helvetica 24pt d
LTAnno
(Similar answer here, here and here , I'll try to keep them in sync.)
user3314418
python beginner. thanks for your help! I apologize for any 'stupid' questions in advance.
Updated on June 04, 2022Comments
-
user3314418 almost 2 years
I'm curious if it's possible to use pdfminer to extract font size. I think this would be helpful for separating out different sections. I know there's the discussion below, but I'm curious if it's possible to use pdfminer
Extract text from PDF in respect to formatting (font size, type etc)
the pdfminer documentation says it's possible http://www.unixuser.org/~euske/python/pdfminer/
but when i type in he following into the command line, i just get a plain text document. I don't see any font information.
pdf2txt.py -o output.html samples/CentolaCV.pdf
e.g...
2008-13 Assistant Professor, Sloan School of Management, M.I.T. 2006-08 Robert Wood Johnson Scholar in Health Policy, Harvard University 2001-02 Visiting Scholar, The Brookings Institution
-
Imane E. over 6 yearsIs it possible to get font-weight too? I need the text in bold.