Java - Text Extraction from PDF using OCR

java pdf pdfbox text-extraction pdftextstream

11,054

I tried with PDFBox and it produced satisfactory results.

Here is the code to extract text from PDF using PDFBox:

import java.io.*;

import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.util.*;

public class PDFTest {

 public static void main(String[] args){
 PDDocument pd;
 BufferedWriter wr;
 try {
         File input = new File("C:/BillOCR/data/bill.pdf");  // The PDF file from where you would like to extract
         File output = new File("D:/SampleText.txt"); // The text file where you are going to store the extracted data
         pd = PDDocument.load(input);
         System.out.println(pd.getNumberOfPages());
         System.out.println(pd.isEncrypted());
         pd.save("CopyOfBill.pdf"); // Creates a copy called "CopyOfInvoice.pdf"
         PDFTextStripper stripper = new PDFTextStripper();
         stripper.setStartPage(1); //Start extracting from page 3
         stripper.setEndPage(1); //Extract till page 5
         wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
         stripper.writeText(pd, wr);
         if (pd != null) {
             pd.close();
         }
        // I use close() to flush the stream.
        wr.close();
 } catch (Exception e){
         e.printStackTrace();
        }
     }
}

11,054

Author by

Dax Amin

Updated on July 25, 2022

Comments

Dax Amin almost 2 years

I have a pdf file (some part of it given below), and want to extract text from it. I have used PDFTextStream, but it doesn't work with this file. (However it worked with other file, that has simple text).

What other OCR libraries are capable of doing it?

Please Help. Thank you.
Amedee Van Gasse about 8 years

So you didn't need OCR at all.
Nitin almost 4 years

This will work if you have a well-formed pdf. If will not give result if some on taking a picture and save as pdf. for this you need OCR