Producing valid XML with Java and UTF-8 encoding

68,302

Solution 1

Use a FileOutputStream rather than a FileWriter.

The latter applies its own encoding, which is almost certainly not UTF-8 (depending on your platform, it's probably Windows-1252 or IS-8859-1).

Edit (now that I have some time):

An XML document without a prologue is permitted to be encoded as UTF-8 or UTF-16. With a prologue, it iss allowed to specify its encoding (the prologue can contain only US-ASCII characters, so prologue is always readable).

A Reader deals with characters; it will decode the byte stream of the underlying InputStream. As a result, when you pass a Reader to the parser, you are telling it that you've already handled the encoding, so the parser will ignore the prologue. When you pass an InputStream (which reads bytes), it does not make this assumption, and will look to the prologue to define the encoding -- or default to UTF-8/UTF-16 if it's not there.

I've never tried reading a file that is encoded in UTF-16. I suspect that the parser will look for a Byte Order Mark (BOM) as the first 2 bytes of the file.

Solution 2

Well, for sure 0xFC and 0xF6 are not valid UTF-8 characters. These should have been finnesed to the two byte sequences: 0x3CBC and 0x3CB6.

Most likely the problem is with the original source of the characters being defined as UTF-8 when they are not.

Share:
68,302
Mike Tunnicliffe
Author by

Mike Tunnicliffe

Maths & Computing C Swift and Kitura Java and JVMs Node.js and Express Ruby

Updated on July 05, 2022

Comments

  • Mike Tunnicliffe
    Mike Tunnicliffe almost 2 years

    I am using JAXP to generate and parse an XML document from which some fields are loaded from a database.

    Code to serialize the XML:

    DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
    Document doc = builder.newDocument();
    Element root = doc.createElement("test");
    root.setAttribute("version", text);
    doc.appendChild(root);
    
    DOMSource domSource = new DOMSource(doc);
    TransformerFactory tFactory = TransformerFactory.newInstance();
    
    FileWriter out = new FileWriter("test.xml");
    Transformer transformer = tFactory.newTransformer();
    transformer.setOutputProperty(OutputKeys.INDENT, "yes");
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.transform(domSource, new StreamResult(out)); 
    

    Code to parse the XML:

    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    factory.setNamespaceAware(true);
    DocumentBuilder builder = factory.newDocumentBuilder();
    Document doc = builder.parse("test.xml");
    

    And I encounter the following exception:

    [Fatal Error] test.xml:1:4: Invalid byte 1 of 1-byte UTF-8 sequence.
    Exception in thread "main" org.xml.sax.SAXParseException: Invalid byte 1 of 1-byte UTF-8 sequence.
        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
        at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
        at com.test.Test.xml(Test.java:27)
        at com.test.Test.main(Test.java:55)
    

    The String text includes u-umlaut and o-umlaut (character codes 0xFC and 0xF6). These are the characters that are causing the error. When I escape the String myself to use ü and ö then the problem goes away. Other entities are automatically encoded when I write out the XML.

    How do I get my output to be written / read properly without substituting these characters myself?

    (I've read the following questions already:

    How to encode characters from Oracle to XML?

    Repairing wrong encoding in XML files)

  • Mike Tunnicliffe
    Mike Tunnicliffe over 15 years
    Nice and easy, I did think of changing to this but discarded the idea since I didn't see a way to specify the encoding in the constructor. It worked just fine, thanks.
  • Mike Tunnicliffe
    Mike Tunnicliffe over 15 years
    Changing the FileWriter to a FileOutputStream did indeed lead to these characters being encoded with two byte sequences: 0xC3BC and 0xC3B6.
  • Ivan_Bereziuk
    Ivan_Bereziuk over 15 years
    Excellent answer -- I will always look for hidden Gotchas in FileWriter from now on!
  • josepmra
    josepmra about 8 years
    I don't want use files. My information comes from database. I contruct the dom and pass it to Xades library.