how to extract formatted text content from PDF

24,205

Solution 1

To extract the text from the PDF AND get it's position you can use PDFMiner. PDFMiner can also export the PDF directly in HTML keeping the text at the good position.

I don't know your use case, but there's a lot of problems you can encounter when doing this because PDF is really presentation oriented and not content oriented, the text flow is not continous. So, if you want the text to be editable, it will not be an easy task.

Solution 2

Have you tried pyPDF or ReportLab PDF libraries? I personally have not used them but you can have a go at them. here is useful too

Solution 3

Xpdf has a utility call PDFtoText that does a great job. http://foolabs.com/xpdf/download.html

Solution 4

If you want to do it just like Google:

Google converts the PDF to an image, and then overlays the image, where text used to be, with JavaScript highlightable areas (which is about like Voodoo magic). The areas appear to be text when you scroll over them with your cursor, but they're not. This might not help you to know, but that's how they do it. If you want to reverse engineer it, you might start with https://www.mercurial-scm.org/ On the home page, they do the same thing with JavaScript to make the text highlightable and copyable. You can extract the text from the PDF, and find it's location in the page with on of the mentioned libraries in the other answers. Then you can overlay an extracted image of the file with the same style of JavaScript areas.

Solution 5

If you don't have your heart set on doing this with python, Ghostscript can do this for you. Check out pdf2ascii (a script that comes with GS) to get the plain text. Styles are more complicated as they can be specified in a few different ways.

Share:
24,205
hoju
Author by

hoju

nothing to see here, move along now

Updated on June 20, 2020

Comments

  • hoju
    hoju almost 4 years

    How can I extract the text content (not images) from a PDF while (roughly) maintaining the style and layout like Google Docs can?

  • hoju
    hoju over 14 years
    ah you're right - they are using images, which is not what I want because I need to manipulate the text
  • naught101
    naught101 over 9 years
    This package is available in ubuntu under the name python-pdfminer, and the command is pdf2txt.