Extract text from pdf file using javascript

84,736

Solution 1

here is a nice example of how to use pdf.js for extracting the text: http://git.macropus.org/2011/11/pdftotext/example/

of course you have to remove a lot of code for your purpose, but it should do it

Solution 2

I've made an easier approach that doesn't need to post messages between iframes using the same library (using the latest version), using pdf.js.

The following example would extract all the text only from the first page of the PDF:

/**
 * Retrieves the text of a specif page within a PDF Document obtained through pdf.js 
 * 
 * @param {Integer} pageNum Specifies the number of the page 
 * @param {PDFDocument} PDFDocumentInstance The PDF document obtained 
 **/
function getPageText(pageNum, PDFDocumentInstance) {
    // Return a Promise that is solved once the text of the page is retrieven
    return new Promise(function (resolve, reject) {
        PDFDocumentInstance.getPage(pageNum).then(function (pdfPage) {
            // The main trick to obtain the text of the PDF page, use the getTextContent method
            pdfPage.getTextContent().then(function (textContent) {
                var textItems = textContent.items;
                var finalString = "";

                // Concatenate the string of the item to the final string
                for (var i = 0; i < textItems.length; i++) {
                    var item = textItems[i];

                    finalString += item.str + " ";
                }

                // Solve promise with the text retrieven from the page
                resolve(finalString);
            });
        });
    });
}

/**
 * Extract the test from the PDF
 */

var PDF_URL  = '/path/to/example.pdf';
PDFJS.getDocument(PDF_URL).then(function (PDFDocumentInstance) {

    var totalPages = PDFDocumentInstance.pdfInfo.numPages;
    var pageNumber = 1;

    // Extract the text
    getPageText(pageNumber , PDFDocumentInstance).then(function(textPage){
        // Show the text of the page in the console
        console.log(textPage);
    });

}, function (reason) {
    // PDF loading error
    console.error(reason);
});

Read the article about this solution here. As @xarxziux mentioned, the library has changed since the first solution was posted (it shouldn't work with the latest version of pdf.js anymore). This should work for most of the cases.

Share:
84,736
Coccinelle
Author by

Coccinelle

Updated on July 09, 2022

Comments

  • Coccinelle
    Coccinelle almost 2 years

    I want to extract text from pdf file using only Javascript in the client side without using the server. I've already found a javascript code in the following link: extract text from pdf in Javascript

    and then in

    http://hublog.hubmed.org/archives/001948.html

    and in:

    https://github.com/hubgit/hubgit.github.com/tree/master/2011/11/pdftotext

    1) I want please to know what are the files which are necessary for these extraction from the previous ones. 2) I don't know exactly how to adapt these codes in an application, not in the web.

    Any answer is welcome. Thank you.

  • xarxziux
    xarxziux over 7 years
    Note for future Googlers: the official pdf.js project appears to have changed hands several times since the links above were posted, but it currently resides in Mozilla's GitHub page - github.com/mozilla/pdf.js
  • Jun711
    Jun711 over 5 years
    @Allanon Do u know any way to extract text and keep their semantics? The example just grabs all the text without considering line breaks, paragraphs, titles etc.
  • Rishabh Garg
    Rishabh Garg over 5 years
    This method is not giving data in right format. We can't find where is line-break, paragraph.
  • Admin
    Admin over 5 years
    @Jun711 How did you get the line breaks? I achieve it?
  • Sancarn
    Sancarn over 5 years
    @RishabhGarg Bare in mind PDFs don't know about the format or even order of the text. You are lucky that you can get the text at all. The exported format may even be inconsistent. This is why the original demo replaced all whitespace with a single space. This at least kinda keeps format consistent.
  • MrMartin
    MrMartin about 5 years
    PDFDocumentInstance.pdfInfo.numPages should now be PDFDocumentInstance.numPages
  • Carlos Delgado
    Carlos Delgado about 5 years
    @Sancarn you're right. For better results use OCR (optical character recognition) instead.
  • Sancarn
    Sancarn about 5 years
    @CarlosDelgado I think a combined approach is best personally. OCRs tend to get many characters incorrect in my experience. Combining the 2 techniques would likely produce far better results. Not sure if there are any libraries out there for this or not however.