How to best detect encoding in XML file?

10,590

Ok, I should have thought of this earlier. Both XmlTextReader (which gives us the Encoding) and XmlReader.Create (which allows us to specify encoding) accepts a Stream. So how about first opening a FileStream and then use this with both XmlTextReader and XmlReader, like this:

using (var txtreader = new FileStream(filepath, FileMode.Open))
{
    using (var xmlreader = new XmlTextReader(txtreader))
    {
        // Read in the encoding info
        xmlreader.MoveToContent();
        var encoding = xmlreader.Encoding;

        // Rewind to the beginning
        txtreader.Seek(0, SeekOrigin.Begin);

        var settings = new XmlReaderSettings { NameTable = new NameTable() };
        var xmlns = new XmlNamespaceManager(settings.NameTable);
        var context = new XmlParserContext(null, xmlns, "", XmlSpace.Default,
                 encoding);

        using (var reader = XmlReader.Create(txtreader, settings, context))
        {
            return XElement.Load(reader);
        }
    }
}

This works like a charm. Reading XML files in an encoding independent way should have been more elegant but at least I'm getting away with only one file open.

Share:
10,590
Peter Lillevold
Author by

Peter Lillevold

I'm a software developer from Norway. I like clean code, industrial metal and good food.

Updated on June 21, 2022

Comments

  • Peter Lillevold
    Peter Lillevold almost 2 years

    To load XML files with arbitrary encoding I have the following code:

    Encoding encoding;
    using (var reader = new XmlTextReader(filepath))
    {
        reader.MoveToContent();
        encoding = reader.Encoding;
    }
    
    var settings = new XmlReaderSettings { NameTable = new NameTable() };
    var xmlns = new XmlNamespaceManager(settings.NameTable);
    var context = new XmlParserContext(null, xmlns, "", XmlSpace.Default, 
        encoding);
    using (var reader = XmlReader.Create(filepath, settings, context))
    {
        return XElement.Load(reader);
    }
    

    This works, but it seems a bit inefficient to open the file twice. Is there a better way to detect the encoding such that I can do:

    1. Open file
    2. Detect encoding
    3. Read XML into an XElement
    4. Close file
  • petr k.
    petr k. about 11 years
    Would just calling the XmlReaderCreate(Stream) overload work the same way in terms of detecting the encoding?
  • Peter Lillevold
    Peter Lillevold about 11 years
    @petrk. - I'm using XmlTextReader explicitly since that's the class providing the Encoding property. Not sure what else you had in mind?
  • petr k.
    petr k. about 11 years
    Right, let me explain. It seems that XElement.Load(XmlReader.Create(new FileStream(filepath, FileMode.Open))) should do the some thing (disposing resources omitted for brevity). The documentation for XmlReader.Create(Stream) says: The XmlReader scans the first bytes of the stream looking for a byte order mark or other sign of encoding. When encoding is determined, the encoding is used to continue reading the stream, and processing continues parsing the input as a stream of (Unicode) characters. I was wondering if your explicit
  • petr k.
    petr k. about 11 years
    encoding detection is any different from what XmlReader.Create(Stream) overload does.
  • Peter Lillevold
    Peter Lillevold about 11 years
    @petrk. interesting... I'm sure I had a situation back then where XmlReader alone didn't work and I had to specify the encoding explicitly via the parser context to make it work. I should have recorded more of my scenario here because now I cannot remember all the details :)
  • petr k.
    petr k. about 11 years
    I am in the exact same situation, also having something similar to your sample in my codebase. I remember trying a lot of things before getting to that solution, but now it seems I could have just used the most straightforward way instead. Not sure if there's a risk of breaking anything, since I have a lot of code depending on this.
  • Peter Lillevold
    Peter Lillevold about 11 years
    @petrk. - only way to be sure is to build some test cases with files of various encoding.