Why doesn't &lt; convert to < while &gt; and other escape characters convert appropriately

10,068

Solution 1

Have a look at the syntax rules described in the XML specification:

The right angle bracket (>) may be represented using the string &gt;, and must, for compatibility, be escaped using either &gt; or a character reference when it appears in the string ]]> in content, when that string is not marking the end of a CDATA section.

So, this is legal:

<xml> > </xml>

...but wouldn't be in this case:

<xml> ]]&gt; </xml>

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as &apos;, and the double-quote character (") as &quot;.

So, this is legal:

<xml attr=" ' " />

...but wouldn't be in this case:

<xml attr=' &apos; ' />

Some escape encoder developers are overly cautious or it is easier to generalize representations because you don't need to know so much about the context in which you are escaping data.

Solution 2

XSLT files are also XML files, so when you enter an escaped value like &lt; or &gt; it is understood as the < or > character. All characters are then escaped for the output XML as needed.

However, in the context of the output XML document it is not necessary to escape > or ", only <, so only the < is escaped.

You need to look at the XML spec for details, in particular the CharData production of an XML parser (which is where you are in the example you are giving). In XML < and & must always be escaped if they're intended to be character data, and > must be escaped if it's after ]].

" never needs to be escaped in a CharData context. Where you might need to escape " is if you are inside an attribute value (AttValue production) which is using " as the delimiter, e.g. myattribute="this is a quote: &quot;".

Note that if you are generating XML using a tool that is ignorant of the XML parsing context (e.g., if you are constructing XML using strings), then the safest and easiest thing to do is to escape all three characters all the time. This is why you often see &gt; and &quot; escaped unnecessarily.

Share:
10,068
Charu Khurana
Author by

Charu Khurana

Java enthusiast, currently Mule consultant

Updated on June 04, 2022

Comments

  • Charu Khurana
    Charu Khurana almost 2 years

    I send a soap request through SoapUI and use XSLT to generate an XML response back. My response needs to look like this:

    <soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
    <soap:Body>
      <TXLifeResponse>         
         <TransType tc="1201">General Requirement Order Request</TransType>
      </TXLifeResponse>
     </soap:Body>
    </soap:Envelope> 
    

    I am able to get this response from the below XSLT

    <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
         xmlns:acord="http://ACORD.org/Standards/Life/2"  
         xmlns="http://ACORD.org/Standards/Life/2">
      <xsl:output method="xml" encoding="utf-8" indent="yes" />
      <xsl:variable name="TransType_tc" select="//acord:TXLife/acord:TXLifeRequest//acord:TransType/@tc" />
      <xsl:variable name="TransType" select="//acord:TXLife/acord:TXLifeRequest//acord:TransType" />
    
      <xsl:template match="/">       
             <TXLifeResponse>       
              <xsl:element name="TransType">
                <xsl:attribute name="tc"><xsl:value-of select="$TransType_tc"/></xsl:attribute>
                <xsl:value-of select="$TransType"/>
              </xsl:element>
            </TXLifeResponse>
      </xsl:template>
    </xsl:stylesheet>
    

    but I noticed an unexpected behavior. Before using <xsl:element and <xsl:attribute> tag, I did something like this in XSLT:

    &lt;TransType tc=&quot;<xsl:value-of select="$TransType_tc"/>&quot;&gt;<xsl:value-of select="$TransType"/> &lt;/TransType&gt;
    

    and the output received in SoapUI was:

    &lt;TransType tc="1201">General Requirement Order Request &lt;/TransType>
    

    Can anyone help me in understanding why &lt; was not converted but&gt; and &quot; did.

    Thanks

  • Charu Khurana
    Charu Khurana over 11 years
    Can you please mention internet references to However, in the context of the output XML document it is not necessary to escape > or ", only <, so only the < is escaped. It will help in better understanding
  • JLRishe
    JLRishe over 11 years
    I was not able to find any internet references, but I can confirm that > does not require escaping anywhere in XML, and " and ' only require escaping to avoid having them mistaken for closing an attribute value, as in <node greeting="&quot;Hello&quot;" />.
  • Francis Avila
    Francis Avila over 11 years
    @JLRishe, I have expanded the answer with a more thorough explanation. Refer to the full XML spec if you want even more detail. > does indeed need to be escaped sometimes in XML.
  • JLRishe
    JLRishe over 11 years
    Ok, yes, I had read someone mention the ]]> case while I was doing my search, but it seemed like an edge case that wasn't worth mentioning.
  • Francis Avila
    Francis Avila over 11 years
    It's hardly an edge case if it produces an invalid XML document.