Java - Text Extraction from PDF using OCR

11,054

I tried with PDFBox and it produced satisfactory results.

Here is the code to extract text from PDF using PDFBox:

import java.io.*;

import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.util.*;

public class PDFTest {

 public static void main(String[] args){
 PDDocument pd;
 BufferedWriter wr;
 try {
         File input = new File("C:/BillOCR/data/bill.pdf");  // The PDF file from where you would like to extract
         File output = new File("D:/SampleText.txt"); // The text file where you are going to store the extracted data
         pd = PDDocument.load(input);
         System.out.println(pd.getNumberOfPages());
         System.out.println(pd.isEncrypted());
         pd.save("CopyOfBill.pdf"); // Creates a copy called "CopyOfInvoice.pdf"
         PDFTextStripper stripper = new PDFTextStripper();
         stripper.setStartPage(1); //Start extracting from page 3
         stripper.setEndPage(1); //Extract till page 5
         wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
         stripper.writeText(pd, wr);
         if (pd != null) {
             pd.close();
         }
        // I use close() to flush the stream.
        wr.close();
 } catch (Exception e){
         e.printStackTrace();
        }
     }
}
Share:
11,054
Dax Amin
Author by

Dax Amin

Updated on July 25, 2022

Comments

  • Dax Amin
    Dax Amin almost 2 years

    I have a pdf file (some part of it given below), and want to extract text from it. I have used PDFTextStream, but it doesn't work with this file. (However it worked with other file, that has simple text).

    What other OCR libraries are capable of doing it?

    Please Help. Thank you.

    Glimpses of pdf file

    glipmses of pdf file

  • Amedee Van Gasse
    Amedee Van Gasse about 8 years
    So you didn't need OCR at all.
  • Nitin
    Nitin almost 4 years
    This will work if you have a well-formed pdf. If will not give result if some on taking a picture and save as pdf. for this you need OCR