Is it possible to parse MS Word using Apache POI and convert it into XML?

10,257

Solution 1

I'd say you have two options, both powered by Apache POI

One is to use Apache Tika. Tika is a text and metadata extraction toolkit, and is able to extract fairly rich text from Word documents by making appropriate calls to POI. The result is that Tika will give you XHTML style XML for the contents of your word document.

The other option is to use a class that was added fairly recently to POI, which is WordToHtmlConverter. This will turn your word document into HTML for you, and generally will preserve slightly more of the structure and formatting than Tika will.

Depending on the kind of XML you're hoping to get out, one of these should be a good bet for you. I'd suggest you try both against some of your sample files, and see which one is the best fit for your problem domain and needs.

Solution 2

The purpose of HWPF subproject is exactly that: process Word files.

http://poi.apache.org/hwpf/index.html

Then, to convert the data to XML you have to build XML by the ususal ways: StAX, JDOM, XStream...

Apache offers a Quick Guide:

http://poi.apache.org/hwpf/quick-guide.html

and I also have found that:

http://sanjaal.com/java/tag/simple-java-tutorial-to-read-microsoft-document-in-java/

If you want to process docx files, you might want to look at the OpenXML4J subproject:

http://poi.apache.org/oxml4j/index.html

Share:
10,257
user2434
Author by

user2434

Updated on June 05, 2022

Comments

  • user2434
    user2434 almost 2 years

    Is it possible to convert a MS Word to XML file using Apache POI ?

    If it is, can you point me to any tutorials for doing that?