How to change character encoding of XmlReader

47,053

Solution 1

To force .NET to read the file in as ISO-8859-9, just use one of the many XmlReader.Create overloads, e.g.

using(XmlReader r = XmlReader.Create(new StreamReader(fileName, Encoding.GetEncoding("ISO-8859-9")))) {
    while(r.Read()) {
        Console.WriteLine(r.Value);
    }
}

However, that may not work because, IIRC, the W3C XML standard says something about when the XML declaration line has been read, a compliant parser should immediately switch to the encoding specified in the XML declaration regardless of what encoding it was using before. In your case, if the XML file has no XML declaration, the encoding will be UTF-8 and it will still fail. I may be talking nonsense here so try it and see. :-)

Solution 2

The XmlTextReader class (which is what the static Create method is actually returning, since XmlReader is the abstract base class) is designed to automatically detect encoding from the XML file itself - there's no way to set it manually.

Simply insure that you include the following XML declaration in the file you are reading:

<?xml version="1.0" encoding="ISO-8859-9"?>

Solution 3

If you can't ensure that the input file has the right header, you could look at one of the other 11 overloads to the XmlReader.Create method.

Some of these take an XmlReaderSettings variable or XmlParserContext variable, or both. I haven't investigated these, but there is a possibility that setting the appropriate values might help here.

There is the XmlReaderSettings.CheckCharacters property - the help for this states:

Instructs the reader to check characters and throw an exception if any characters are outside the range of legal XML characters. Character checking includes checking for illegal characters in the document, as well as checking the validity of XML names (for example, an XML name may not start with a numeral).

So setting this to false might help. However, the help also states:

If the XmlReader is processing text data, it always checks that the XML names and text content are valid, regardless of the property setting. Setting CheckCharacters to false turns off character checking for character entity references.

So further investigation is warranted.

Share:
47,053

Related videos on Youtube

themiurge
Author by

themiurge

Updated on July 09, 2022

Comments

  • themiurge
    themiurge almost 2 years

    I have a simple XmlReader:

    XmlReader r = XmlReader.Create(fileName);
    
    while (r.Read())
    {
        Console.WriteLine(r.Value);
    }
    

    The problem is, the Xml file has ISO-8859-9 characters in it, which makes XmlReader throw "Invalid character in the given encoding." exception. I can solve this problem with adding <?xml version="1.0" encoding="ISO-8859-9" ?> line in the beginning but I'd like to solve this in another way in case I can't modify the source file. How can I change the encoding of XmlReader?

  • Andreas Ågren
    Andreas Ågren about 13 years
    Be careful, the streamreader is not closed after the end of the using statement when using a syntax like this. The safe way would be: using(StreamReader sr = new StreamReader(fileName, Encoding.GetEncoding("ISO-8859-9"))) using(XmlReader r = XmlReader.Create(sr)) { ... }
  • Christian Hayter
    Christian Hayter about 13 years
    @Andreas: Are you sure? I've just drilled down into the code with Reflector, and it does appear to close the underlying Stream and TextReader when the reader is closed. Have a look at System.Xml.XmlTextReaderImpl+ParsingState.Close(Boolean).
  • Andreas Ågren
    Andreas Ågren about 13 years
    Yes, I had this issue in .NET 2.0, perhaps it works in newer versions?