Best way to encode text data for XML

127,271

Solution 1

System.XML handles the encoding for you, so you don't need a method like this.

Solution 2

Depending on how much you know about the input, you may have to take into account that not all Unicode characters are valid XML characters.

Both Server.HtmlEncode and System.Security.SecurityElement.Escape seem to ignore illegal XML characters, while System.XML.XmlWriter.WriteString throws an ArgumentException when it encounters illegal characters (unless you disable that check in which case it ignores them). An overview of library functions is available here.

Edit 2011/8/14: seeing that at least a few people have consulted this answer in the last couple years, I decided to completely rewrite the original code, which had numerous issues, including horribly mishandling UTF-16.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;

/// <summary>
/// Encodes data so that it can be safely embedded as text in XML documents.
/// </summary>
public class XmlTextEncoder : TextReader {
    public static string Encode(string s) {
        using (var stream = new StringReader(s))
        using (var encoder = new XmlTextEncoder(stream)) {
            return encoder.ReadToEnd();
        }
    }

    /// <param name="source">The data to be encoded in UTF-16 format.</param>
    /// <param name="filterIllegalChars">It is illegal to encode certain
    /// characters in XML. If true, silently omit these characters from the
    /// output; if false, throw an error when encountered.</param>
    public XmlTextEncoder(TextReader source, bool filterIllegalChars=true) {
        _source = source;
        _filterIllegalChars = filterIllegalChars;
    }

    readonly Queue<char> _buf = new Queue<char>();
    readonly bool _filterIllegalChars;
    readonly TextReader _source;

    public override int Peek() {
        PopulateBuffer();
        if (_buf.Count == 0) return -1;
        return _buf.Peek();
    }

    public override int Read() {
        PopulateBuffer();
        if (_buf.Count == 0) return -1;
        return _buf.Dequeue();
    }

    void PopulateBuffer() {
        const int endSentinel = -1;
        while (_buf.Count == 0 && _source.Peek() != endSentinel) {
            // Strings in .NET are assumed to be UTF-16 encoded [1].
            var c = (char) _source.Read();
            if (Entities.ContainsKey(c)) {
                // Encode all entities defined in the XML spec [2].
                foreach (var i in Entities[c]) _buf.Enqueue(i);
            } else if (!(0x0 <= c && c <= 0x8) &&
                       !new[] { 0xB, 0xC }.Contains(c) &&
                       !(0xE <= c && c <= 0x1F) &&
                       !(0x7F <= c && c <= 0x84) &&
                       !(0x86 <= c && c <= 0x9F) &&
                       !(0xD800 <= c && c <= 0xDFFF) &&
                       !new[] { 0xFFFE, 0xFFFF }.Contains(c)) {
                // Allow if the Unicode codepoint is legal in XML [3].
                _buf.Enqueue(c);
            } else if (char.IsHighSurrogate(c) &&
                       _source.Peek() != endSentinel &&
                       char.IsLowSurrogate((char) _source.Peek())) {
                // Allow well-formed surrogate pairs [1].
                _buf.Enqueue(c);
                _buf.Enqueue((char) _source.Read());
            } else if (!_filterIllegalChars) {
                // Note that we cannot encode illegal characters as entity
                // references due to the "Legal Character" constraint of
                // XML [4]. Nor are they allowed in CDATA sections [5].
                throw new ArgumentException(
                    String.Format("Illegal character: '{0:X}'", (int) c));
            }
        }
    }

    static readonly Dictionary<char,string> Entities =
        new Dictionary<char,string> {
            { '"', "&quot;" }, { '&', "&amp;"}, { '\'', "&apos;" },
            { '<', "&lt;" }, { '>', "&gt;" },
        };

    // References:
    // [1] http://en.wikipedia.org/wiki/UTF-16/UCS-2
    // [2] http://www.w3.org/TR/xml11/#sec-predefined-ent
    // [3] http://www.w3.org/TR/xml11/#charsets
    // [4] http://www.w3.org/TR/xml11/#sec-references
    // [5] http://www.w3.org/TR/xml11/#sec-cdata-sect
}

Unit tests and full code can be found here.

Solution 3

SecurityElement.Escape

documented here

Solution 4

In the past I have used HttpUtility.HtmlEncode to encode text for xml. It performs the same task, really. I haven't run into any issues with it yet, but that's not to say I won't in the future. As the name implies, it was made for HTML, not XML.

You've probably already read it, but here is an article on xml encoding and decoding.

EDIT: Of course, if you use an xmlwriter or one of the new XElement classes, this encoding is done for you. In fact, you could just take the text, place it in a new XElement instance, then return the string (.tostring) version of the element. I've heard that SecurityElement.Escape will perform the same task as your utility method as well, but havent read much about it or used it.

EDIT2: Disregard my comment about XElement, since you're still on 2.0

Solution 5

Microsoft's AntiXss library AntiXssEncoder Class in System.Web.dll has methods for this:

AntiXss.XmlEncode(string s)
AntiXss.XmlAttributeEncode(string s)

it has HTML as well:

AntiXss.HtmlEncode(string s)
AntiXss.HtmlAttributeEncode(string s)
Share:
127,271
Joel Coehoorn
Author by

Joel Coehoorn

2009-2013 Microsoft ASP.Net MVP It's pronounced: koo-horn. The avatar is both because I play counter strike and a nod to lambda expressions in C#. Twitter: @jcoehoorn

Updated on July 05, 2022

Comments

  • Joel Coehoorn
    Joel Coehoorn almost 2 years

    I was looking for a generic method in .Net to encode a string for use in an Xml element or attribute, and was surprised when I didn't immediately find one. So, before I go too much further, could I just be missing the built-in function?

    Assuming for a moment that it really doesn't exist, I'm putting together my own generic EncodeForXml(string data) method, and I'm thinking about the best way to do this.

    The data I'm using that prompted this whole thing could contain bad characters like &, <, ", etc. It could also contains on occasion the properly escaped entities: &amp;, &lt;, and &quot;, which means just using a CDATA section may not be the best idea. That seems kinda klunky anyay; I'd much rather end up with a nice string value that can be used directly in the xml.

    I've used a regular expression in the past to just catch bad ampersands, and I'm thinking of using it to catch them in this case as well as the first step, and then doing a simple replace for other characters.

    So, could this be optimized further without making it too complex, and is there anything I'm missing? :

    Function EncodeForXml(ByVal data As String) As String
        Static badAmpersand As new Regex("&(?![a-zA-Z]{2,6};|#[0-9]{2,4};)")
    
        data = badAmpersand.Replace(data, "&amp;")
    
        return data.Replace("<", "&lt;").Replace("""", "&quot;").Replace(">", "gt;")
    End Function
    

    Sorry for all you C# -only folks-- I don't really care which language I use, but I wanted to make the Regex static and you can't do that in C# without declaring it outside the method, so this will be VB.Net

    Finally, we're still on .Net 2.0 where I work, but if someone could take the final product and turn it into an extension method for the string class, that'd be pretty cool too.

    Update The first few responses indicate that .Net does indeed have built-in ways of doing this. But now that I've started, I kind of want to finish my EncodeForXml() method just for the fun of it, so I'm still looking for ideas for improvement. Notably: a more complete list of characters that should be encoded as entities (perhaps stored in a list/map), and something that gets better performance than doing a .Replace() on immutable strings in serial.

  • Joel Coehoorn
    Joel Coehoorn over 15 years
    This is in a library that will be used for both asp.net apps and batch processing (desktop).
  • Joel Coehoorn
    Joel Coehoorn over 15 years
    I'll have to check that- the problems I've had in the past are from reading bad docs generated by others, and I haven't done much writing yet. This would certainly explain the lack of a built-in function.
  • MusiGenesis
    MusiGenesis over 15 years
    Yeah, if the other docs didn't encode correctly, System.XML won't read them correctly.
  • Joel Coehoorn
    Joel Coehoorn over 15 years
    This seems like what I'm looking for, but there are some comments at the bottom indicating the implementation is less than stellar.
  • Justin Ohms
    Justin Ohms over 15 years
    You can actually access Server.HTMLEncode() in a desktop app - all you have to do is ad a reference to System.Web
  • MusiGenesis
    MusiGenesis over 15 years
    It would encode the ampersand. Whatever string you put in is exactly what you'll get back out.
  • Joel Coehoorn
    Joel Coehoorn over 15 years
    So then I still need a way to handle incoming data that may be partially encoded.
  • Sekhat
    Sekhat over 14 years
    Or go shout at the guys who aren't encoding their xml correctly.
  • Pag Sun
    Pag Sun over 13 years
    Good answer, have seen the similar solution from this article: seattlesoftware.wordpress.com/2008/09/11/…
  • Michael Kropat
    Michael Kropat over 13 years
    That article explains the problem really well.
  • Dmitry Dzygin
    Dmitry Dzygin about 13 years
    Neither Server.HtmlEncode() nor HttpUtility.HtmlAttributeEncode() replace characters like '\0'
  • codeulike
    codeulike almost 13 years
    For the bit (0x100000 <= c && c <= 0x10FFFF) my compiler warns me: "Comparison to integral constant is useless; the constant is outside the range of type 'char'"
  • Michael Kropat
    Michael Kropat almost 13 years
    Thanks codeulike — pointing out the warning was the kick I needed to finally rewrite the original, buggy code. =) Please try the new code if you get a chance.
  • Cohen
    Cohen over 12 years
    +1 for updating your code :) and revisiting the question (helped me out)
  • Dan7
    Dan7 over 12 years
    @MichaelKropat: Hi, thanks for the class. By any chance do you also have an XmlTextDecoder up?
  • Michael Kropat
    Michael Kropat over 12 years
    The built-in XmlReader should handle that for you. Take a look at this: stackoverflow.com/questions/5304311/…
  • ddotsenko
    ddotsenko over 12 years
    Or, use it's relative on XmlNode object - .InnerText Getter and Setter decode and encode.
  • Richard Anthony Freeman-Hein
    Richard Anthony Freeman-Hein over 12 years
    @MichaelKropat The latest version in GitHub doesn't work for me ... for two different tests. I will try the unit tests later and get back to you.
  • Richard Anthony Freeman-Hein
    Richard Anthony Freeman-Hein over 12 years
    @MichaelKropat Sorry, it does work, but for my case, I have to change it slightly to not encode the valid XML entities, but just remove unsupported unicode characters. Dealing with surrogate pairs was my problem, so thanks for the code.
  • Armstrongest
    Armstrongest over 11 years
    &amp; isn't valid XML. I would assume it would use the XML entity: &#38;
  • KreepN
    KreepN over 11 years
    It seems that the easiest solution is the best sometimes. Saved me a large chunk of time, mucho appreciated.
  • Michael
    Michael over 11 years
    @Sekhat That's an unreasonable solution. In the real world, large data vendors often cannot be bothered to fix these types of issues, as doing so would break their clients' data.
  • Admin
    Admin about 10 years
    @Mick: Adhering to common standards is not "unreasonable." If a vendor wants to develop their own alternative messaging format, that's fine, but we should not encourage sloppiness.
  • Michael
    Michael about 10 years
    @TrevorSullivan That approach works reasonably well in academia, but not so much elsewhere. If you only knew how half-baked some of the financial world's implementations of common specs are (ranging from CRC implementations to things as trivial as XML - I'm speaking from my first hand experience only), you might decide to keep your money in a mattress at home.
  • MusiGenesis
    MusiGenesis about 10 years
    @Mick: if you knew how mattresses were made today, you might decide to take your money back to the bank.
  • drzaus
    drzaus almost 9 years
    link dead
  • Stuart Dobson
    Stuart Dobson over 8 years
    Just noting for anyone thinking this is a good idea, System.Web is a big overhead and not really meant for class libraries/windows apps
  • Kev
    Kev over 8 years
    @stuartdotnet - hence the caveat "If this is an ASP.NET app".
  • Don Cheadle
    Don Cheadle about 8 years
    This was accepted? It's not an answer. Sometimes we have to work with code that is using XML strings
  • Don Cheadle
    Don Cheadle about 8 years
    later, I noticed the answer with 60+ votes - so you're right. A pointless comment - other than maybe pointing someone else towards the better answer below.
  • MusiGenesis
    MusiGenesis about 8 years
    @mmcrae: heh, "later, I noticed the answer with 60+ votes" - you're just learning how scroll bars work? Your complaint (that my correct answer is somehow obscuring other answers) is about something fundamental with StackOverflow and nothing at all to do with me. In any event, you seriously think adding the eleventh comment on an 8-year-old answer is somehow "pointing someone else towards the better answer below"?
  • MusiGenesis
    MusiGenesis almost 8 years
    @mmcrae: wow, that worked, thanks! You should switch to JSON, anyway. :)
  • Marcia Pereira Reis
    Marcia Pereira Reis almost 8 years
    XmlConvert.IsXmlChar successfully identified invalid XML chars, although does not escape "<>", etc.
  • schizoid04
    schizoid04 about 6 years
    An example of how to actually use your answer would have been helpful.
  • jamheadart
    jamheadart over 5 years
    An example of how to actually use your answer would have been helpful.
  • MusiGenesis
    MusiGenesis over 5 years
    @jamheadart There really isn't any example to present here, though. The point of my answer was that System.Xml handles encoding for you under the hood and completely automatically - there is nothing to be "used" for encoding.
  • mklement0
    mklement0 almost 2 years
    @Armstrongest, &amp; is valid XML - see en.wikipedia.org/wiki/…. Ronnie: System.Xml.Linq.XText correctly does not escape " and ', because XML doesn't require it. However, like SecurityElement.Escape it also doesn't handle translating illegal chars. into character references. By contrast, System.Xml.XmlDocument does.
  • mklement0
    mklement0 almost 2 years
    Note that neither System.Xml.Linq.XText instances nor the System.SecuritySecurityElement.Escape() nor the (made for HTML) System.Web.HttpUtility.HtmlEncode() methods handle translating illegal chars. into character references (e.g, &#x1B; for ESC). By contrast, System.Xml.XmlDocument instances do.