Ruby: Reading PDF files

26,452

Solution 1

You might find Docsplit useful:

Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)

Solution 2

After trying different methods, I'm using PDF-Toolkit now. It's quite old, but it's fast, stable and reliable. Besides, it really doesn't need to be new, because it just wraps the xpdf commandline utilities.

Solution 3

You could use JRuby and a Java PDF library parser such as ApachePDFBox (https://www.ohloh.net/p/pdfbox). See also http://java-source.net/open-source/pdf-libraries.

Solution 4

Did you have a look at the CombinePDF library?

It's a pure ruby solution that allows some PDF manipulation, such as extracting pages, overlaying one PDF page over another, page numbering, writing basic text and tables, etc'.

Here's an example for stumping an existing PDF file with a logo. The example reads a PDF file, extracts one page to use as a stamp and stamps another PDF file.

require 'combine_pdf'
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf = CombinePDF.load "content_file.pdf"
pdf.pages.each {|page| page << company_logo}
pdf.save "content_with_logo.pdf"

You can also stamp text, number pages or use :

require 'combine_pdf'

pdf = CombinePDF.load "content_file.pdf"

pdf.number_pages #adds page numbers. you can add formatting and placement options.

pdf.pages.each {|page| page.textbox "One Way To Stamp"}

#you can a shortcut method to stamp pages
pdf.stamp_pages "Another way to stamp"

#you can use the shortcut method for both text and PDF stamps
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf.stamp_pages company_logo

# you can use write simple tables
pdf.pages[0].write_table headers: ['first name', 'surname'], table_data: [['John', 'Doe'], ['Mr.', 'Smith']]

pdf.save "content_with_logo.pdf"

It's not meant for complex operations, but it complements most PDF authoring libraries and allows you to use PDF templates instead of writing the whole thing from scratch.

Share:
26,452
Javier
Author by

Javier

Javier manages products and sometimes wishes he had more time to play with Ruby, Rails and his kids.

Updated on June 27, 2020

Comments

  • Javier
    Javier almost 4 years

    I'm looking for a fast and reliable way to read/parse large PDF files in Ruby (on Linux and OSX).

    Until now I've found the rather old and simple PDF-toolkit (a pdftotext-wrapper) and PDF-reader, which was unable to read most of my files. Though the two libraries provide exactly the functionality I was looking for.

    My question: Have I missed something? Is there a tool that is better suited (faster and more reliable) to solve my problem?

  • Javier
    Javier about 15 years
    Why would I need OCR ("optical character recognition") to read a PDF that doesn't consist of scanned text? Wouldn't that needlessly slow down the whole process?
  • Javier
    Javier about 15 years
    That sounds like an interesting alternative. Have you seen an implementation or an example somewhere?
  • Terry
    Terry about 15 years
    No. OCR is the process of converting images to text. PDF readers and PDF toolkits utilize this concept to convert an image (the same that is output from, say, a scanner) to text.
  • Javier
    Javier about 15 years
    So basically you're saying, that all text inside a PDF consists of an image that needs to be recognized as text first?
  • Terry
    Terry about 15 years
    I'd be lying if I said yes or no. I just know I've had success with OCRs.
  • jashkenas
    jashkenas almost 14 years
    Javier: do take a look at Docsplit. It wraps the Apache PDFBox library for text extraction -- because we've had better quality results with PDFBox that pdftotext.
  • Cris Stringfellow
    Cris Stringfellow about 12 years
    Some PDFs are scanned, others encode the text in ultra weird formats like (t)(h)(i)(s) (s)(o)(m)(e) (t)(e)((x)(t)) also some use deflate.
  • Sean Larkin
    Sean Larkin almost 11 years
    @pw. Installed all the libraries and followed all the documentation for this, however I was having a hard time, do you have any referrals for tutorials or documentation that goes beyond 2 lines of code?
  • Javier
    Javier about 9 years
    @Trejkaz I guess I meant to say that I "only" wanted to read them and therefore wasn't planning to buy software for it? But who knows what I was thinking 6yrs ago? ;-)