Generating docx file from HTML file using OpenXML

15,759

Solution 1

You cannot just insert the HTML content into a "document.xml", this part expects only a WordprocessingML content so you'll have to convert that HTML into WordprocessingML, see this.

Another thing that you could use is altChunk element, with it you would be able to place a HTML file inside your DOCX file and then reference that HTML content on some specific place inside your document, see this.

Last as an alternative, with GemBox.Document library you could accomplish exactly what you want, see the following:

public static void CreateDocument(string documentFileName, string text)
{
    DocumentModel document = new DocumentModel();
    document.Content.LoadText(text, LoadOptions.HtmlDefault);
    document.Save(documentFileName);
}

Or you could actually straightforwardly convert a HTML content into a DOCX file:

public static void Convert(string documentFileName, string htmlText)
{
    HtmlLoadOptions options = LoadOptions.HtmlDefault;
    using (var htmlStream = new MemoryStream(options.Encoding.GetBytes(htmlText)))
        DocumentModel.Load(htmlStream, options)
                     .Save(documentFileName);
}

Solution 2

I realize I'm 7 years late to the game here. Still, for future people searching on how to convert from HTML to Word Doc, this blog posting on a Microsoft MSDN site gives most of the ingredients necessary to do this using OpenXML. I found the post itself to be confusing, but the source code that he included clarified it all for me.

The only piece that was missing was how to build a Docx file from scratch, instead of how to merge into an existing one as his example shows. I found that tidbit from here.

Unfortunately the project I used this in is written in vb.net. So I'm going to share the vb.net code first, then an automated c# conversion of it, that may or may not be accurate.

vb.net code:

Imports DocumentFormat.OpenXml
Imports DocumentFormat.OpenXml.Packaging
Imports DocumentFormat.OpenXml.Wordprocessing
Imports System.IO

Dim ms As IO.MemoryStream
Dim mainPart As MainDocumentPart
Dim b As Body
Dim d As Document
Dim chunk As AlternativeFormatImportPart
Dim altChunk As AltChunk

Const altChunkID As String = "AltChunkId1"

ms = New MemoryStream()

Using myDoc = WordprocessingDocument.Create(ms,WordprocessingDocumentType.Document)
    mainPart = myDoc.MainDocumentPart

    If mainPart Is Nothing Then
        mainPart = myDoc.AddMainDocumentPart()

        b = New Body()
        d = New Document(b)
        d.Save(mainPart)
    End If

    chunk = mainPart.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.Xhtml, altChunkID)

    Using chunkStream As Stream = chunk.GetStream(FileMode.Create, FileAccess.Write)
        Using stringStream As StreamWriter = New StreamWriter(chunkStream)
            stringStream.Write("YOUR HTML HERE")
        End Using
    End Using

    altChunk = New AltChunk()
    altChunk.Id = altChunkID
    mainPart.Document.Body.InsertAt(Of AltChunk)(altChunk, 0)
    mainPart.Document.Save()
End Using

c# code:

using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
using System.IO;

IO.MemoryStream ms;
MainDocumentPart mainPart;
Body b;
Document d;
AlternativeFormatImportPart chunk;
AltChunk altChunk;

string altChunkID = "AltChunkId1";

ms = new MemoryStream();

Using (myDoc = WordprocessingDocument.Create(ms, WordprocessingDocumentType.Document))
{
    mainPart = myDoc.MainDocumentPart;

    if (mainPart == null) 
    {
         mainPart = myDoc.AddMainDocumentPart();
         b = new Body();
         d = new Document(b);
         d.Save(mainPart);
    }

    chunk = mainPart.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.Xhtml, altChunkID);

    Using (Stream chunkStream = chunk.GetStream(FileMode.Create, FileAccess.Write)
    {
         Using (StreamWriter stringStream = new StreamWriter(chunkStream))         
         {
              stringStream.Write("YOUR HTML HERE");
         }
    }    

    altChunk = new AltChunk();
    altChunk.Id = altChunkID;
    mainPart.Document.Body.InsertAt(Of, AltChunk)[altChunk, 0];
    mainPart.Document.Save();
}

Note that I'm using the ms memory stream in another routine, which is where it's disposed of after use.

I hope this helps someone else!

Solution 3

I could successfully convert HTML content to docx file using OpenXML in an .net Core using this code

string html = "<strong>Hello</strong> World";
using (MemoryStream generatedDocument = new MemoryStream()){
   using (WordprocessingDocument package = 
                  WordprocessingDocument.Create(generatedDocument,
                  WordprocessingDocumentType.Document)){
   MainDocumentPart mainPart = package.MainDocumentPart;
   if (mainPart == null){
    mainPart = package.AddMainDocumentPart();
    new Document(new Body()).Save(mainPart);
}
HtmlConverter converter = new HtmlConverter(mainPart);
converter.ParseHtml(html);
mainPart.Document.Save();
}

To save on disk

System.IO.File.WriteAllBytes("filename.docx", generatedDocument.ToArray());

To return the file for download in net core mvc, use

return File(generatedDocument.ToArray(), 
          "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
          "filename.docx");
Share:
15,759
newbie
Author by

newbie

Updated on June 05, 2022

Comments

  • newbie
    newbie almost 2 years

    I'm using this method for generating docx file:

    public static void CreateDocument(string documentFileName, string text)
    {
        using (WordprocessingDocument wordDoc =
            WordprocessingDocument.Create(documentFileName, WordprocessingDocumentType.Document))
        {
            MainDocumentPart mainPart = wordDoc.AddMainDocumentPart();
    
            string docXml =
                        @"<?xml version=""1.0"" encoding=""UTF-8"" standalone=""yes""?>
                     <w:document xmlns:w=""http://schemas.openxmlformats.org/wordprocessingml/2006/main"">
                     <w:body><w:p><w:r><w:t>#REPLACE#</w:t></w:r></w:p></w:body>
                     </w:document>";
    
            docXml = docXml.Replace("#REPLACE#", text);
    
            using (Stream stream = mainPart.GetStream())
            {
                byte[] buf = (new UTF8Encoding()).GetBytes(docXml);
                stream.Write(buf, 0, buf.Length);
            }
        }
    }
    

    It works like a charm:

    CreateDocument("test.docx", "Hello");
    

    But what if I want to put HTML content instead of Hello? for example:

    CreateDocument("test.docx", @"<html><head></head>
                                  <body>
                                        <h1>Hello</h1>
                                  </body>
                            </html>");
    

    Or something like this:

    CreateDocument("test.docx", @"Hello<BR>
                                        This is a simple text<BR>
                                        Third paragraph<BR>
                                        Sign
                            ");
    

    both cases creates an invalid structure for document.xml. Any idea? How can I generate a docx file from a HTML content?

  • JasonPlutext
    JasonPlutext almost 8 years
    My post docx4java.org/blog/2014/09/… ends with a couple of other options
  • tomRedox
    tomRedox over 5 years
    One thing to note here is that the HTML your inserting needs to be wrapped in the <html></html> tag for it to be rendered as HTML, i.e. stringStream.Write(@"<html><h2>Hi There</h2></body>")
  • Anand Murali
    Anand Murali about 2 years
    What is the namespace of HtmlConverter class?