XmlReader breaks on UTF-8 BOM

12,872

Solution 1

The xml string must not (!) contain the BOM, the BOM is only allowed in byte data (e.g. streams) which is encoded with UTF-8. This is because the string representation is not encoded, but already a sequence of unicode characters.

It therefore seems that you load the string wrong, which is in code you unfortunatley didn't provide.

Edit:

Thanks for posting the serialization code.

You should not write the data to a MemoryStream, but rather to a StringWriter which you can then convert to a string with ToString. Since this avoids passing through a byte representation it is not only faster but also avoids such problems.

Something like this:

private static string SerializeResponse(Response response)
{
    var output = new StringWriter();
    var writer = XmlWriter.Create(output);
    new XmlSerializer(typeof(Response)).Serialize(writer, response);
    return output.ToString();
}

Solution 2

In my request handler I'm serializing a response object and sending it back as a string. The serialization process adds a UTF-8 BOM to the front of the string, which causes the same code to break when parsing the response.

So you want to prevent the BOM from being added as part of your serialization process. Unfortunately, you don't provide what your serialization logic is.

What you should do is provide a UTF8Encoding instance created via the UTF8Encoding(bool) constructor to disable generation of the BOM, and pass this Encoding instance to whichever methods you're using which are generating your intermediate string.

Share:
12,872
Matt Mills
Author by

Matt Mills

Updated on June 08, 2022

Comments

  • Matt Mills
    Matt Mills 4 months

    I have the following XML Parsing code in my application:

        public static XElement Parse(string xml, string xsdFilename)
        {
            var readerSettings = new XmlReaderSettings
            {
                ValidationType = ValidationType.Schema,
                Schemas = new XmlSchemaSet()
            };
            readerSettings.Schemas.Add(null, xsdFilename);
            readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ProcessInlineSchema;
            readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ProcessSchemaLocation;
            readerSettings.ValidationFlags |= XmlSchemaValidationFlags.ReportValidationWarnings;
            readerSettings.ValidationEventHandler +=
                (o, e) => { throw new Exception("The provided XML does not validate against the request's schema."); };
    
            var readerContext = new XmlParserContext(null, null, null, XmlSpace.Default, Encoding.UTF8);
    
            return XElement.Load(XmlReader.Create(new StringReader(xml), readerSettings, readerContext));
        }
    

    I am using it to parse strings sent to my WCF service into XML documents, for custom deserialization.

    It works fine when I read in files and send them over the wire (the request); I've verified that the BOM is not sent across. In my request handler I'm serializing a response object and sending it back as a string. The serialization process adds a UTF-8 BOM to the front of the string, which causes the same code to break when parsing the response.

    System.Xml.XmlException : Data at the root level is invalid. Line 1, position 1.
    

    In the research I've done over the last hour or so, it appears that XmlReader should honor the BOM. If I manually remove the BOM from the front of the string, the response xml parses fine.

    Am I missing something obvious, or at least something insidious?

    EDIT: Here is the serialization code I'm using to return the response:

    private static string SerializeResponse(Response response)
    {
        var output = new MemoryStream();
        var writer = XmlWriter.Create(output);
        new XmlSerializer(typeof(Response)).Serialize(writer, response);
        var bytes = output.ToArray();
        var responseXml = Encoding.UTF8.GetString(bytes);
        return responseXml;
    }
    

    If it's just a matter of the xml incorrectly containing the BOM, then I'll switch to

    var responseXml = new UTF8Encoding(false).GetString(bytes);
    

    but it was not clear at all from my research that the BOM was illegal in the actual XML string; see e.g. c# Detect xml encoding from Byte Array?