Creating PDF from Word (DOC) using Apache POI and iText in JAVA

32,873

Solution 1

docx4j includes code for creating a PDF from a docx using iText. It can also use POI to convert a doc to a docx.

There was a time when we supported both methods equally (as well as PDF via XHTML), but we decided to focus on XSL-FO.

If its an option, you'd be much better off using docx4j to convert a docx to PDF via XSL-FO and FOP.

Use it like so:

        wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));

        // Set up font mapper
        Mapper fontMapper = new IdentityPlusMapper();
        wordMLPackage.setFontMapper(fontMapper);

        // Example of mapping missing font Algerian to installed font Comic Sans MS
        PhysicalFont font 
                = PhysicalFonts.getPhysicalFonts().get("Comic Sans MS");
        fontMapper.getFontMappings().put("Algerian", font);             

        org.docx4j.convert.out.pdf.PdfConversion c 
            = new org.docx4j.convert.out.pdf.viaXSLFO.Conversion(wordMLPackage);
        //  = new org.docx4j.convert.out.pdf.viaIText.Conversion(wordMLPackage);

        OutputStream os = new java.io.FileOutputStream(inputfilepath + ".pdf");         
        c.output(os);

Update July 2016

As of docx4j 3.3.0, Plutext's commercial PDF renderer is docx4j's default option for docx to PDF conversion. You can try an online demo at converter-eval.plutext.com

If you want to use the existing docx to XSL-FO to PDF (or other target supported by Apache FOP) approach, then just add the docx4j-export-FO jar to your classpath.

Either way, to convert docx to PDF, you can use the Docx4J facade's toPDF method.

The old docx to PDF via iText code can be found at https://github.com/plutext/docx4j-export-FO/.../docx4j-extras/PdfViaIText/

Solution 2

WordExtractor just grabs the plain text, nothing else. That's why all you're seeing is the plain text.

What you'll need to do is get each paragraph individually, then grab each run, fetch the formatting, and generate the equivalent in PDF.

One option may be to find some code that turns XHTML into a PDF. Then, use Apache Tika to turn your word document into XHTML (it uses POI under the hood, and handles all the formatting stuff for you), and from the XHTML on to PDF.

Otherwise, if you're going to do it yourself, take a look at the code in Apache Tika for parsing word files. It's a really great example of how to get at the images, the formatting, the styles etc.

Solution 3

Use OpenOffice/LbreOffice and JODConnector This also mostly works for .doc to .docx. Problems with graphics that I have not yet worked out though.

    private static void transformDocXToPDFUsingJOD(File in, File out)
{
    OfficeDocumentConverter converter = new OfficeDocumentConverter(officeManager);
    DocumentFormat pdf = converter.getFormatRegistry().getFormatByExtension("pdf");
    converter.convert(in, out, pdf);
}



private static OfficeManager officeManager;

@BeforeClass
public static void setupStatic() throws IOException {

    /*officeManager = new DefaultOfficeManagerConfiguration()
      .setOfficeHome("C:/Program Files/LibreOffice 3.6")
      .buildOfficeManager();
      */
    officeManager = new ExternalOfficeManagerConfiguration().setConnectOnStart(true).setPortNumber(8100).buildOfficeManager();


    officeManager.start();
}

@AfterClass
public static void shutdownStatic() throws IOException {

    officeManager.stop();
}

You need to be running LibreOffice as a serverto make this work. From the command line you can do this using;

"C:\Program Files\LibreOffice 3.6\program\soffice.exe" -accept="socket,host=0.0.0.0,port=8100;urp;LibreOffice.ServiceManager" -headless -nodefault -nofirststartwizard -nolockcheck -nologo -norestore

Solution 4

I have succesfully used Apache FOP to convert a 'WordML' document to PDF. WordML is the Office 2003 way of saving a Word document as xml. XSLT stylesheets can be found on the web to transform this xml to xml-fo which in turn can be rendered by FOP into PDF (among other outputs).

It's not so different from the solution plutext offered, except that it doesn't read a .doc document, whereas docx4j apparently does. If your requirements are flexible enough to have WordML style documents as input, this might be worth looking into.

Good luck with your project! Wim

Share:
32,873
Ismet
Author by

Ismet

Updated on July 19, 2020

Comments

  • Ismet
    Ismet almost 4 years

    I am trying to generate a PDF document from a *.doc document. Till now and thanks to stackoverflow I have success generating it but with some problems.

    My sample code below generates the pdf without formatations and images, just the text. The document includes blank spaces and images which are not included in the PDF.

    Here is the code:

            in = new FileInputStream(sourceFile.getAbsolutePath());
            out = new FileOutputStream(outputFile);
    
            WordExtractor wd = new WordExtractor(in);
    
            String text = wd.getText();
    
            Document pdf= new Document(PageSize.A4);
    
            PdfWriter.getInstance(pdf, out);
    
            pdf.open();
            pdf.add(new Paragraph(text));
    
  • Ismet
    Ismet almost 13 years
    I could not really get into the Tika project for parsing the word fils. Do know about any other project for parsing the word file or an example project / description how to parse it yourself. I need only formatation and pictures beside the regular text in the word file.
  • Gagravarr
    Gagravarr almost 13 years
    Tika should be very easy to get started with! Just grab the Tika CLI program and pass the word file to it, and you'll get back XHTML. Get happy with that, then start calling the Java yourself.