Ruby: Reading PDF files

ruby-on-rails ruby pdf pdf-parsing

26,452

Solution 1

You might find Docsplit useful:

Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)

Solution 2

After trying different methods, I'm using PDF-Toolkit now. It's quite old, but it's fast, stable and reliable. Besides, it really doesn't need to be new, because it just wraps the xpdf commandline utilities.

Solution 3

You could use JRuby and a Java PDF library parser such as ApachePDFBox (https://www.ohloh.net/p/pdfbox). See also http://java-source.net/open-source/pdf-libraries.

Solution 4

Did you have a look at the CombinePDF library?

It's a pure ruby solution that allows some PDF manipulation, such as extracting pages, overlaying one PDF page over another, page numbering, writing basic text and tables, etc'.

Here's an example for stumping an existing PDF file with a logo. The example reads a PDF file, extracts one page to use as a stamp and stamps another PDF file.

require 'combine_pdf'
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf = CombinePDF.load "content_file.pdf"
pdf.pages.each {|page| page << company_logo}
pdf.save "content_with_logo.pdf"

You can also stamp text, number pages or use :

require 'combine_pdf'

pdf = CombinePDF.load "content_file.pdf"

pdf.number_pages #adds page numbers. you can add formatting and placement options.

pdf.pages.each {|page| page.textbox "One Way To Stamp"}

#you can a shortcut method to stamp pages
pdf.stamp_pages "Another way to stamp"

#you can use the shortcut method for both text and PDF stamps
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf.stamp_pages company_logo

# you can use write simple tables
pdf.pages[0].write_table headers: ['first name', 'surname'], table_data: [['John', 'Doe'], ['Mr.', 'Smith']]

pdf.save "content_with_logo.pdf"

It's not meant for complex operations, but it complements most PDF authoring libraries and allows you to use PDF templates instead of writing the whole thing from scratch.

View more solutions

26,452

Author by

Javier

Javier manages products and sometimes wishes he had more time to play with Ruby, Rails and his kids.

Updated on June 27, 2020

Comments

Javier almost 4 years

I'm looking for a fast and reliable way to read/parse large PDF files in Ruby (on Linux and OSX).

Until now I've found the rather old and simple PDF-toolkit (a pdftotext-wrapper) and PDF-reader, which was unable to read most of my files. Though the two libraries provide exactly the functionality I was looking for.

My question: Have I missed something? Is there a tool that is better suited (faster and more reliable) to solve my problem?
Javier about 15 years

Why would I need OCR ("optical character recognition") to read a PDF that doesn't consist of scanned text? Wouldn't that needlessly slow down the whole process?
Javier about 15 years

That sounds like an interesting alternative. Have you seen an implementation or an example somewhere?
Terry about 15 years

No. OCR is the process of converting images to text. PDF readers and PDF toolkits utilize this concept to convert an image (the same that is output from, say, a scanner) to text.
Javier about 15 years

So basically you're saying, that all text inside a PDF consists of an image that needs to be recognized as text first?
Terry about 15 years

I'd be lying if I said yes or no. I just know I've had success with OCRs.
jashkenas almost 14 years

Javier: do take a look at Docsplit. It wraps the Apache PDFBox library for text extraction -- because we've had better quality results with PDFBox that pdftotext.
Cris Stringfellow about 12 years

Some PDFs are scanned, others encode the text in ultra weird formats like (t)(h)(i)(s) (s)(o)(m)(e) (t)(e)((x)(t)) also some use deflate.
Sean Larkin almost 11 years

@pw. Installed all the libraries and followed all the documentation for this, however I was having a hard time, do you have any referrals for tutorials or documentation that goes beyond 2 lines of code?
Javier about 9 years

@Trejkaz I guess I meant to say that I "only" wanted to read them and therefore wasn't planning to buy software for it? But who knows what I was thinking 6yrs ago? ;-)