Cleaning up pdftotext font issues

pdf conversion special-characters ascii

8,680

Solution 1

By default, pdftotext outputs unicode (UTF-8) data. If your terminal or text editor doesn't support UTF-8, ligatures such as "fi" and "fl" (which can be represented as a single character in unicode) will appear strangely, as you have noticed.

The simple fix is to tell pdftotext to output ASCII instead of unicode:

pdftotext -enc ASCII7 input.pdf output.txt

This should produce clean ASCII output, removing your need to clean it up manually afterwards.

Solution 2

Assuming you're on some kind of Unix-based system, you could run this on the output of pdftotext:

sed -i -e 's/ﬃ/ffi/g' -e 's/ﬁ/fi/g' -e 's/ﬀ/ff/g' -e 's/ﬂ/fl/g' -e 's/ﬄ/ffl/g' output.txt

That should replace the ligatures with the individual letters they break into. (See my comments above for what ligatures have to do with this.)

I tested that on a text file generated through pdftotext from a LaTeX-generated PDF. And it worked fine. But if the LaTeX used a nonstandard encoding or font with additional ligatures there may be more to do.

You'll probably want to make sure the font you're using in your terminal has characters for the f-series ligatures. DejaVu Sans Mono is a good choice.

8,680

Blaizz

Updated on September 17, 2022

Comments

Blaizz over 1 year

I'm using pdftotext to make an ASCII version of a PDF document (made with LaTeX), because collaborators prefer a simple document in MS Word.

The plain text version I see looks good, but upon closer inspection the f character seems to be frequently mis-converted depending on what characters follow. For example, fi and fl often seem to become one special character, which I will try to paste here: ﬁ and ﬂ.

What is the best way to clean up the output of pdftotext? I am thinking sed might be the right tool, but I'm not sure how to detect these special characters.
- frabjous over 13 years
  
  fl, fi, ff, ffl, and ffi are common typographic ligatures, commonly replaced by a single character (and definitely with TeX): en.wikipedia.org/wiki/Typographic_ligature#Computer_typesett‌ing - perhaps you just need to check that the font you're viewing the output in has them, and that the encoding is right.
- frabjous over 13 years
  
  oh, and you mean pdftotext from poppler, right, not pdftotex ?
- frabjous over 13 years
  
  Do you have the original TeX source? Why not use, e.g., latex2rtf or oolatex (from TeX4ht) to generate a Word Processor file for the Word junkies? Compiling to PDF and then converting to plain text seems like a very weird route for conversion.
- frabjous over 13 years
  
  Oh, and if you DO want to convert PDF to plain text, consider using ebook-convert from calibre (calibre-ebook.com) rather than pdftotext. It allows plain text output (and a variety of other formats), and handles ligatures for you.
- Admin over 13 years
  
  I did mean pdftotext. Typo fixed. I have original TeX source, but latex2rtf and oolatex do not work as well as pdftotext. I use additional packages like siunitx and glossaries, and therefore it seems like going via the PDF is the best solution. I wish there were a better way.
- Admin over 13 years
  
  Thanks for the ebook-convert suggestion, that seems to work better than pdftotext.
Liquidizer over 13 years

Thanks. I found the ebook-convert suggestion above to be the best. Your advice might improve the default behavior of pdfottext, but I think my terminal does support UTF-8, and ebook-convert seems to handle superscripts and other things better.
amenthes over 5 years

this solution will also not work if you actually need unicode characters in your output.
amenthes over 5 years

In case your terminal is not utf-8 (for example windows cmd.exe), you can also do this with the byte representation: sed -e 's/\\xEF\\xAC\\x80/ff/g' -e 's/\\xEF\\xAC\\x81/fi/g' -e 's/\\xEF\\xAC\\x82/fl/g' -e 's/\\xEF\\xAC\\x83/ffi/g' -e 's/\\xEF\\xAC\\x84/ffl/g' -e 's/\\xEF\\xAC\\x85/ft/g' -e 's/\\xEF\\xAC\\x86/st/g'.
GDP2 almost 3 years

I did a similar process by converting to UTF-8 with pdftotext -enc UTF-8, and then I simply copied the UTF-8 characters and replaced them with their correct ASCII counterparts. I eyeballed the original PDF to make sure it made sense for characters which had no direct ASCII counterpart.
GDP2 almost 3 years

The problem with this is, you can end up with words like dene for define or suer for suffer. Converting to ASCII directly will automatically strip all ligatures and won't bother trying to convert them to their ASCII alternatives.