Saving XML in UTF-8 with MSXML

15,906

Solution 1

There isn't any non-ANSI text in your XML file, so it will be identical whether UTF-8 or ASCII encoded. In my tests, after adding non-ASCII text to test.xml, MSXML always saves in UTF-8 encoding and also writes the BOM if there was one to begin with.

http://en.wikipedia.org/wiki/UTF-8
http://en.wikipedia.org/wiki/Byte_order_mark

Solution 2

You use two other classes in MSXML to write out XML properly encoded to an output stream.

Here is my helper method that writes to a generic IStream:

class procedure TXMLHelper.WriteDocumentToStream(const Document60: IXMLDOMDocument2; const stream: IStream; Encoding: string = 'UTF-8');
var
    writer: IMXWriter;
    reader: IVBSAXXMLReader;
begin
{
    From http://support.microsoft.com/kb/275883
    INFO: XML Encoding and DOM Interface Methods

    MSXML has native support for the following encodings:
        UTF-8
        UTF-16
        UCS-2
        UCS-4
        ISO-10646-UCS-2
        UNICODE-1-1-UTF-8
        UNICODE-2-0-UTF-16
        UNICODE-2-0-UTF-8

    It also recognizes (internally using the WideCharToMultibyte API function for mappings) the following encodings:
        US-ASCII
        ISO-8859-1
        ISO-8859-2
        ISO-8859-3
        ISO-8859-4
        ISO-8859-5
        ISO-8859-6
        ISO-8859-7
        ISO-8859-8
        ISO-8859-9
        WINDOWS-1250
        WINDOWS-1251
        WINDOWS-1252
        WINDOWS-1253
        WINDOWS-1254
        WINDOWS-1255
        WINDOWS-1256
        WINDOWS-1257
        WINDOWS-1258
}

    if Document60 = nil then
        raise Exception.Create('TXMLHelper.WriteDocument: Document60 cannot be nil');
    if stream = nil then
        raise Exception.Create('TXMLHelper.WriteDocument: stream cannot be nil');

    // Set properties on the XML writer - including BOM, XML declaration and encoding
    writer := CoMXXMLWriter60.Create;
    writer.byteOrderMark := True; //Determines whether to write the Byte Order Mark (BOM). The byteOrderMark property has no effect for BSTR or DOM output. (Default True)
    writer.omitXMLDeclaration := False; //Forces the IMXWriter to skip the XML declaration. Useful for creating document fragments. (Default False)
    writer.encoding := Encoding; //Sets and gets encoding for the output. (Default "UTF-16")
    writer.indent := True; //Sets whether to indent output. (Default False)
    writer.standalone := True;

    // Set the XML writer to the SAX content handler.
    reader := CoSAXXMLReader60.Create;
    reader.contentHandler := writer as IVBSAXContentHandler;
    reader.dtdHandler := writer as IVBSAXDTDHandler;
    reader.errorHandler := writer as IVBSAXErrorHandler;
    reader.putProperty('http://xml.org/sax/properties/lexical-handler', writer);
    reader.putProperty('http://xml.org/sax/properties/declaration-handler', writer);


    writer.output := stream; //The resulting document will be written into the provided IStream

    // Now pass the DOM through the SAX handler, and it will call the writer
    reader.parse(Document60);

    writer.flush;
end;

In order to save to a file i call the Stream version with a FileStream:

class procedure TXMLHelper.WriteDocumentToFile(const Document60: IXMLDOMDocument2; const filename: string; Encoding: string='UTF-8');
var
    fs: TFileStream;
begin
    fs := TFileStream.Create(filename, fmCreate or fmShareDenyWrite);
    try
        TXMLHelper.WriteDocumentToStream(Document60, fs, Encoding);
    finally
        fs.Free;
    end;
end;

You can convert the functions to whatever language you like. These are Delphi.

Solution 3

When you perform load msxml does not copy encoding from the processing instruction into the created document. So it doesn't contain any encoding and seems like msxml chooses something which it likes. In my environment it's UTF-16 which I don't prefer.

The solution is to provide processing instructions and specify encoding there. If you know that the document has no processing instructions, the code is trivial:

Set pi = xmlDoc.createProcessingInstruction("xml", _
         "version=""1.0"" encoding=""windows-1250""")
If xmlDoc.childNodes.Length > 0 Then
  Call xmlDoc.insertBefore(pi, xmlDoc.childNodes.Item(0))
End If

If it's possible, that the document contained other processing instruction, it must be removed first (so the code below must come before the code above). I don't know how to use selectNode to do it, so I just iterated all root nodes:

For ich=xmlDoc.childNodes.Length-1 to 0 step -1
  Set ch = xmlDoc.childNodes.Item(ich)
  If ch.NodeTypeString = "processinginstruction" and ch.NodeName = "xml" Then
    xmlDoc.removeChild(ch)
  End If
Next ich

Sorry if the code doesn't execute directly, because I modified working version, that was written in something custom, not vbscript.

Share:
15,906
J_C
Author by

J_C

Updated on June 22, 2022

Comments

  • J_C
    J_C almost 2 years

    I'm trying to load a simple Xml file (encoded in UTF-8):

    <?xml version="1.0" encoding="UTF-8"?>
    <Test/>
    

    And save it with MSXML in vbscript:

    Set xmlDoc = CreateObject("MSXML2.DOMDocument.6.0")
    
    xmlDoc.Load("C:\test.xml")
    
    xmlDoc.Save "C:\test.xml" 
    

    The problem is, MSXML saves file in ANSI instead of UTF-8 (despite the original file being encoded in UTF-8).

    The MSDN docs for MSXML says that save() will write the file in whatever encoding the XML is defined in:

    Character encoding is based on the encoding attribute in the XML declaration, such as . When no encoding attribute is specified, the default setting is UTF-8.

    But this is clearly not working at least on my machine.

    How can MSXML save in UTF-8?

  • J_C
    J_C about 14 years
    And so I guess there's no way MSXML can save in UTF-8 if there's no unicode bytes in the file?
  • Kyle Alons
    Kyle Alons about 14 years
    By definition, there is no difference between an ASCII and UTF-8 file if it contains only ASCII characters (except for the BOM if included)...
  • Robertas
    Robertas about 10 years
    I am going to make use of James Bond hobby and try to resurrect this thread. In C++ type library I have the following definition of putProperty method: HRESULT ISAXXMLReader::putProperty ( unsigned short * pwchName, const _variant_t & varValue ) this requires unsigned short pointer as a parameter. By any chance you know whether there are any enums or #defines for supported properties or how could I specify lexical-handler and declaration-handler properties?
  • mistertodd
    mistertodd about 10 years
    @wenaxus I don't even know what lexical-handler or declaration-handler are! :) You should probably ask that as a full new question.
  • Robertas
    Robertas about 10 years
    Fair enough, I just saw you were using them in your answer: ` reader.putProperty('xml.org/sax/properties/lexical-handler', writer);` and thought I give it a try.
  • Pow-Ian
    Pow-Ian almost 8 years
    I searched all over the place to find a way to remove a processing instruction from an xml file using vbScript; this was the only example of a way to do it that I saw (Once I did I felt dumb of course). Select node or nodes does not seem to work because a processing instruction is not in the document element which is where select node and Xpath starts searching from.