Is there some sort of PDF to text -converter?

search pdf ocr text

20,465

Solution 1

You have a lot of options!

pdftotext from poppler has already been mentioned.

There's a Haskell program called pdf2line which works well.

calibre's ebook-convert commandline program (or calibre itself) is another option; it can convert PDF to plain text, or other ebook-format (RTF, ePub), in my opinion it generates better results than pdftotext, although it is considerably slower.

ebook-convert file.pdf file.txt

AbiWord can convert between any formats it knows from the command-line, and at least optionally has a PDF import plugin:

abiword --to=txt file.pdf

Yet another option is podofotextextract from the podofo PDF tools library. I haven't really tried that.

If you combine the two Ghostscript tools, pdf2ps and ps2ascii, you have yet another option.

I can actually think of a few more methods, but I'll leave it at that for now. ;)

Solution 2

You can convert PDFs to text on the command line with pdftotext (Ubuntu: poppler-utils; OpenBSD: xpdf-utils package).

You can use Recoll (Ubuntu: recoll; OpenBSD: no port, but there's one for FreeBSD.) to search inside various formatted text document types, including PDF. There's a GUI, and it builds an index automatically under the hood. It uses pdftotext to convert PDF to text.

Acrobat Reader (at least version 9 under Linux) has a limited multiple-file search capability (you can search in all the files in a directory).

Solution 3

pdftotext is likely what you are looking for: http://en.wikipedia.org/wiki/Pdftotext unless the text you want to extract is really under a graphical form, which is not that common with pdf documents.

20,465

otto

Updated on September 17, 2022

Comments

otto over 1 year

I need PDF files to text so I can search over them in bulk from commandline. Is there some converter for Ubuntu, OBSD or similar distro?

Perhaps related post, OCR with ubuntu here.
- Gilles 'SO- stop being evil' over 13 years
  
  Similar question at Super User
- vonbrand over 11 years
  
  If it is a "real" PDF (made from text, etc) pdftotext is your best bet. If it is an image, your best bet is some OCR stuff.
- isomorphismes about 9 years
  
  I always use pdftotext = pdfcat.
- Trevor Boyd Smith about 6 years
  
  similar question at askubuntu
- rogerdpack about 3 years
  
  You can uncompress them see unix.stackexchange.com/a/17713/8337
Matthew about 12 years

calibre's ebook-convert... have you seen what it does to ligatures? bleargh. let's put it this way: it's not a very e ective program. pdftotext is much more faithful. i have never discovered any errors in its output.
Daniel Näslund about 12 years

You can use less for viewing pdf-files as text. It invokes a preprocessor, i.e. lesspipe, for invoking pdftotext or similar tools.
kenorb almost 10 years

Find pdftotext examples at PDF to TEXT open source command line tool & How to convert all pdf files to text (within a folder) with one command?.
terdon almost 10 years

Hi and welcome to the site. We like answers to be a bit more comprehensive here. For example, you could add where gPDFText can be obtained, how it can be installed and how it would be used to answer the OP's question.
Amit Patel almost 9 years

pdftotext gives more accurate results than ebook-convert and it is very fast. ebook-convert is sluggish.
Stalinko over 5 years

pdftotext with -layout option rocks! calibre requires more than 600mb to install! That's crazy )