Is there some sort of PDF to text -converter?
Solution 1
You have a lot of options!
pdftotext
from poppler has already been mentioned.
There's a Haskell program called pdf2line
which works well.
calibre's ebook-convert
commandline program (or calibre itself) is another option; it can convert PDF to plain text, or other ebook-format (RTF, ePub), in my opinion it generates better results than pdftotext, although it is considerably slower.
ebook-convert file.pdf file.txt
AbiWord can convert between any formats it knows from the command-line, and at least optionally has a PDF import plugin:
abiword --to=txt file.pdf
Yet another option is podofotextextract
from the podofo PDF tools library. I haven't really tried that.
If you combine the two Ghostscript tools, pdf2ps
and ps2ascii
, you have yet another option.
I can actually think of a few more methods, but I'll leave it at that for now. ;)
Solution 2
You can convert PDFs to text on the command line with pdftotext (Ubuntu: poppler-utils; OpenBSD: xpdf-utils
package).
You can use Recoll
(Ubuntu: recoll; OpenBSD: no port, but there's one for FreeBSD.) to search inside various formatted text document types, including PDF. There's a GUI, and it builds an index automatically under the hood. It uses pdftotext
to convert PDF to text.
Acrobat Reader (at least version 9 under Linux) has a limited multiple-file search capability (you can search in all the files in a directory).
Solution 3
pdftotext is likely what you are looking for: http://en.wikipedia.org/wiki/Pdftotext unless the text you want to extract is really under a graphical form, which is not that common with pdf documents.
Related videos on Youtube
otto
Updated on September 17, 2022Comments
-
otto over 1 year
I need PDF files to text so I can search over them in bulk from commandline. Is there some converter for Ubuntu, OBSD or similar distro?
Perhaps related post, OCR with ubuntu here.
-
Gilles 'SO- stop being evil' over 13 years
-
vonbrand over 11 yearsIf it is a "real" PDF (made from text, etc) pdftotext is your best bet. If it is an image, your best bet is some OCR stuff.
-
isomorphismes about 9 yearsI always use
pdftotext
=pdfcat
. -
Trevor Boyd Smith about 6 years
-
rogerdpack about 3 yearsYou can uncompress them see unix.stackexchange.com/a/17713/8337
-
-
Matthew about 12 yearscalibre's ebook-convert... have you seen what it does to ligatures? bleargh. let's put it this way: it's not a very e ective program. pdftotext is much more faithful. i have never discovered any errors in its output.
-
Daniel Näslund about 12 yearsYou can use less for viewing pdf-files as text. It invokes a preprocessor, i.e. lesspipe, for invoking pdftotext or similar tools.
-
kenorb almost 10 yearsFind pdftotext examples at PDF to TEXT open source command line tool & How to convert all pdf files to text (within a folder) with one command?.
-
terdon almost 10 yearsHi and welcome to the site. We like answers to be a bit more comprehensive here. For example, you could add where
gPDFText
can be obtained, how it can be installed and how it would be used to answer the OP's question. -
Amit Patel almost 9 years
pdftotext
gives more accurate results thanebook-convert
and it is very fast.ebook-convert
is sluggish. -
Stalinko over 5 years
pdftotext
with-layout
option rocks!calibre
requires more than 600mb to install! That's crazy )