Cleaning up pdftotext font issues
Solution 1
By default, pdftotext
outputs unicode (UTF-8) data. If your terminal or text editor doesn't support UTF-8, ligatures such as "fi" and "fl" (which can be represented as a single character in unicode) will appear strangely, as you have noticed.
The simple fix is to tell pdftotext
to output ASCII instead of unicode:
pdftotext -enc ASCII7 input.pdf output.txt
This should produce clean ASCII output, removing your need to clean it up manually afterwards.
Solution 2
Assuming you're on some kind of Unix-based system, you could run this on the output of pdftotext:
sed -i -e 's/ffi/ffi/g' -e 's/fi/fi/g' -e 's/ff/ff/g' -e 's/fl/fl/g' -e 's/ffl/ffl/g' output.txt
That should replace the ligatures with the individual letters they break into. (See my comments above for what ligatures have to do with this.)
I tested that on a text file generated through pdftotext
from a LaTeX-generated PDF. And it worked fine. But if the LaTeX used a nonstandard encoding or font with additional ligatures there may be more to do.
You'll probably want to make sure the font you're using in your terminal has characters for the f-series ligatures. DejaVu Sans Mono is a good choice.
Related videos on Youtube
Blaizz
Updated on September 17, 2022Comments
-
Blaizz over 1 year
I'm using
pdftotext
to make an ASCII version of a PDF document (made with LaTeX), because collaborators prefer a simple document in MS Word.The plain text version I see looks good, but upon closer inspection the f character seems to be frequently mis-converted depending on what characters follow. For example, fi and fl often seem to become one special character, which I will try to paste here: fi and fl.
What is the best way to clean up the output of pdftotext? I am thinking
sed
might be the right tool, but I'm not sure how to detect these special characters.-
frabjous over 13 yearsfl, fi, ff, ffl, and ffi are common typographic ligatures, commonly replaced by a single character (and definitely with TeX): en.wikipedia.org/wiki/Typographic_ligature#Computer_typesetting - perhaps you just need to check that the font you're viewing the output in has them, and that the encoding is right.
-
frabjous over 13 yearsoh, and you mean
pdftotext
from poppler, right, notpdftotex
? -
frabjous over 13 yearsDo you have the original TeX source? Why not use, e.g., latex2rtf or oolatex (from TeX4ht) to generate a Word Processor file for the Word junkies? Compiling to PDF and then converting to plain text seems like a very weird route for conversion.
-
frabjous over 13 yearsOh, and if you DO want to convert PDF to plain text, consider using
ebook-convert
from calibre (calibre-ebook.com) rather thanpdftotext
. It allows plain text output (and a variety of other formats), and handles ligatures for you. -
Admin over 13 yearsI did mean pdftotext. Typo fixed. I have original TeX source, but latex2rtf and oolatex do not work as well as pdftotext. I use additional packages like
siunitx
andglossaries
, and therefore it seems like going via the PDF is the best solution. I wish there were a better way. -
Admin over 13 yearsThanks for the
ebook-convert
suggestion, that seems to work better thanpdftotext
.
-
-
Liquidizer over 13 yearsThanks. I found the
ebook-convert
suggestion above to be the best. Your advice might improve the default behavior ofpdfottext
, but I think my terminal does support UTF-8, andebook-convert
seems to handle superscripts and other things better. -
amenthes over 5 yearsthis solution will also not work if you actually need unicode characters in your output.
-
amenthes over 5 yearsIn case your terminal is not utf-8 (for example windows cmd.exe), you can also do this with the byte representation:
sed -e 's/\\xEF\\xAC\\x80/ff/g' -e 's/\\xEF\\xAC\\x81/fi/g' -e 's/\\xEF\\xAC\\x82/fl/g' -e 's/\\xEF\\xAC\\x83/ffi/g' -e 's/\\xEF\\xAC\\x84/ffl/g' -e 's/\\xEF\\xAC\\x85/ft/g' -e 's/\\xEF\\xAC\\x86/st/g'
. -
GDP2 almost 3 yearsI did a similar process by converting to UTF-8 with
pdftotext -enc UTF-8
, and then I simply copied the UTF-8 characters and replaced them with their correct ASCII counterparts. I eyeballed the original PDF to make sure it made sense for characters which had no direct ASCII counterpart. -
GDP2 almost 3 yearsThe problem with this is, you can end up with words like
dene
fordefine
orsuer
forsuffer
. Converting to ASCII directly will automatically strip all ligatures and won't bother trying to convert them to their ASCII alternatives.