extract text from pdf files

16,116

Solution 1

PDFTextExtractor only contains static methods and the constructor is private. itext

You can call it like so:
String myLine = PDFTextExtractor.getTextFromPage(reader, pageNumber)

Solution 2

If you want to get all the text from the PDF file and save it to a text file you can use below code.

Use pdfutil.jar library.

import java.io.IOException;
import java.io.PrintWriter;

import com.testautomationguru.utility.PDFUtil;

public class PDFToText{

    public static void main(String[] args) {

        try {
            String pdfFilePath = "C:\\abc.pdf";
            PDFUtil pdfUtil = new PDFUtil();
            String content = pdfUtil.getText(pdfFilePath);
            PrintWriter out = new PrintWriter("C:\\abc.txt");
            out.println(content);
            out.close();

        } catch (IOException e) {

            e.printStackTrace();
        }
    }

}
Share:
16,116
Rim
Author by

Rim

Updated on June 04, 2022

Comments

  • Rim
    Rim almost 2 years

    I need to extract text (word by word) from a pdf file.

    import java.io.*;
    
    import com.itextpdf.text.*;
    
    import com.itextpdf.text.pdf.*;
    
    import com.itextpdf.text.pdf.parser.*;
    
    public class pdf {
    
        private static String INPUTFILE = "http://ontology.buffalo.edu/ontology%28PIC%29.pdf" ;
    
        private static String OUTPUTFILE = "c:/new3.pdf";
    
        public static void main(String[] args) throws DocumentException,
                IOException {
    
            Document document = new Document();
    
            PdfWriter writer = PdfWriter.getInstance(document,
    
            new FileOutputStream(OUTPUTFILE));
    
            document.open();
    
            PdfReader reader = new PdfReader(INPUTFILE);
    
            int n = reader.getNumberOfPages();
    
            PdfImportedPage page;
    
            // Go through all pages
    
            for (int i = 1; i <= n; i++) {
    
                    page = writer.getImportedPage(reader, i);
    
                    System.out.println(i);
    
    
                    Image instance = Image.getInstance(page);
    
                    document.add(instance);
    
            }
    
            document.close();
    
    
            PdfReader readerN = new PdfReader(OUTPUTFILE);
    
            PdfTextExtractor parse = new PdfTextExtractor();
    
    for (int i = 1; i <= n; i++) 
    
    System.out.println(parser.getTextFromPage(reader,i));
    
    
    }
    

    When I compile the code, I have this error:

    the constructor PdfTextExtractor is undefined

    How do I fix this?