get docx file contents using javascript/jquery

37,760

Solution 1

With docxtemplater, you can easily get the full text of a word (works with docx only) by using the doc.getFullText() method.

HTML code:

<body>
    <button onclick="gettext()">Get document text</button>
</body>
<script src="https://cdnjs.cloudflare.com/ajax/libs/docxtemplater/3.26.2/docxtemplater.js"></script>
<script src="https://unpkg.com/[email protected]/dist/pizzip.js"></script>
<script src="https://unpkg.com/[email protected]/dist/pizzip-utils.js"></script>
<script>
    function loadFile(url, callback) {
        PizZipUtils.getBinaryContent(url, callback);
    }
    function gettext() {
        loadFile(
            "https://docxtemplater.com/tag-example.docx",
            function (error, content) {
                if (error) {
                    throw error;
                }
                var zip = new PizZip(content);
                var doc = new window.docxtemplater(zip);
                var text = doc.getFullText();
                console.log(text);
                alert("Text is " + text);
            }
        );
    }
</script>

Solution 2

I know this is an old post, but doctemplater has moved on and the accepted answer no longer works. This worked for me:

function loadDocx(filename) {
  // Read document.xml from docx document
  const AdmZip = require("adm-zip");
  const zip = new AdmZip(filename);
  const xml = zip.readAsText("word/document.xml");
  // Load xml DOM
  const cheerio = require('cheerio');
  $ = cheerio.load(xml, {
    normalizeWhitespace: true,
    xmlMode: true
  })
  // Extract text
  let out = new Array()
  $('w\\:t').each((i, el) => {
    out.push($(el).text())
  })
  return out
}
Share:
37,760
Abdul Ali
Author by

Abdul Ali

Interested in web application development using microsoft technologies..

Updated on November 06, 2021

Comments

  • Abdul Ali
    Abdul Ali over 2 years

    wish to open / read docx file using client side technologies (HTML/JS).

    kindly assist if this is possible . have found a Javascript library named docx.js but personally cannot seem to locate any documentation for it. (http://blog.innovatejs.com/?p=184)

    the goal is to make a browser based search tool for docx files and txt files .

    any help appreciated.

  • Abdul Ali
    Abdul Ali about 9 years
    thank you for the reply. will look into it. although it seems to solve the issue.
  • Bit_hunter
    Bit_hunter almost 8 years
    your code is not working with jszip version 3.0.0. Would u please update it?
  • edi9999
    edi9999 almost 8 years
    Docxtemplater still depends on [email protected] , you can still install it so it should be working. In future versions, docxtemplater will work with JSZip 3.x
  • Tyler B. Wear
    Tyler B. Wear almost 7 years
    Why does that API squash all the newlines?
  • edi9999
    edi9999 almost 7 years
    It is how it works, to just return a single string, or we would have to use formatting (array of strings or HTML)
  • edi9999
    edi9999 about 5 years
    You could use pandoc for that : Convert docx to html for example : github.com/jgm/pandoc
  • fdrv
    fdrv over 2 years
    Use DocxGen() instead
  • fdrv
    fdrv over 2 years
    Uncaught Error: The constructor with parameters has been removed in JSZip 3.0, please check the upgrade guide. Docxgen is old
  • Udi
    Udi over 2 years
    Life saver, thanks for this!
  • garek007
    garek007 about 2 years
    Is this node JS? What is cheerio?
  • James
    James almost 2 years
    Hi, thanks for the answer. Is there a way we could get the link break for it as well. getFullText seems have no line break. Thanks
  • edi9999
    edi9999 almost 2 years
    Hello @James, I've released a new enhanced code part here that will get the different paragraphs. docxtemplater.com/faq/…
  • James
    James almost 2 years
    @edi9999, thanks for the link, but the problem is that it is node.js version which seems to be runned over server side. Any idea of client side use user's broswer only? Thanks