Is there some sort of PDF to text -converter?

20,465

Solution 1

You have a lot of options!

pdftotext from poppler has already been mentioned.

There's a Haskell program called pdf2line which works well.

calibre's ebook-convert commandline program (or calibre itself) is another option; it can convert PDF to plain text, or other ebook-format (RTF, ePub), in my opinion it generates better results than pdftotext, although it is considerably slower.

ebook-convert file.pdf file.txt

AbiWord can convert between any formats it knows from the command-line, and at least optionally has a PDF import plugin:

abiword --to=txt file.pdf

Yet another option is podofotextextract from the podofo PDF tools library. I haven't really tried that.

If you combine the two Ghostscript tools, pdf2ps and ps2ascii, you have yet another option.

I can actually think of a few more methods, but I'll leave it at that for now. ;)

Solution 2

You can convert PDFs to text on the command line with pdftotext (Ubuntu: poppler-utils; OpenBSD: xpdf-utils package).

You can use Recoll (Ubuntu: recoll; OpenBSD: no port, but there's one for FreeBSD.) to search inside various formatted text document types, including PDF. There's a GUI, and it builds an index automatically under the hood. It uses pdftotext to convert PDF to text.

Acrobat Reader (at least version 9 under Linux) has a limited multiple-file search capability (you can search in all the files in a directory).

Solution 3

pdftotext is likely what you are looking for: http://en.wikipedia.org/wiki/Pdftotext unless the text you want to extract is really under a graphical form, which is not that common with pdf documents.

Share:
20,465

Related videos on Youtube

otto
Author by

otto

Updated on September 17, 2022

Comments

  • otto
    otto over 1 year

    I need PDF files to text so I can search over them in bulk from commandline. Is there some converter for Ubuntu, OBSD or similar distro?

    Perhaps related post, OCR with ubuntu here.

  • Matthew
    Matthew about 12 years
    calibre's ebook-convert... have you seen what it does to ligatures? bleargh. let's put it this way: it's not a very e ective program. pdftotext is much more faithful. i have never discovered any errors in its output.
  • Daniel Näslund
    Daniel Näslund about 12 years
    You can use less for viewing pdf-files as text. It invokes a preprocessor, i.e. lesspipe, for invoking pdftotext or similar tools.
  • kenorb
    kenorb almost 10 years
  • terdon
    terdon almost 10 years
    Hi and welcome to the site. We like answers to be a bit more comprehensive here. For example, you could add where gPDFText can be obtained, how it can be installed and how it would be used to answer the OP's question.
  • Amit Patel
    Amit Patel almost 9 years
    pdftotext gives more accurate results than ebook-convert and it is very fast. ebook-convert is sluggish.
  • Stalinko
    Stalinko over 5 years
    pdftotext with -layout option rocks! calibre requires more than 600mb to install! That's crazy )