Why does en-dash (–) trigger illegal XML character error (C#/SSMS)?

13,997

Solution 1

Can you modify the XML encoding declaration? If so;

declare @xml XML = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?><records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>';

select @xml

(No column name)
<records><r RecordName="Option - Foo" /><r RecordName="Option – Bar" /></records>

Speculative Edit

Both of these fail with illegal xml character:

set @xml = '<?xml version="1.0" encoding="utf-8"?><x> – </x>'
set @xml = '<?xml version="1.0" encoding="utf-16"?><x> – </x>'

because they pass a non-unicode varchar to the XML parser; the string contains Unicode so must be treated as such, i.e. as an nvarchar (utf-16) (otherwise the 3 bytes comprising the are misinterpreted as multiple characters and one or more is not in the acceptable range for XML)

This does pass a nvarchar string to the parser, but fails with unable to switch the encoding:

set @xml = N'<?xml version="1.0" encoding="utf-8"?><x> – </x>'

This is because an nvarchar (utf-16) string is passed to the XML parser but the XML document states its utf-8 and the is not equivalent in the two encodings

This works as everything is utf-16

set @xml = N'<?xml version="1.0" encoding="utf-16"?><x> – </x>'

Solution 2

Please permit me to answer my own question, for the purpose of me understanding it fully myself. I won't accept this as the answer; it is the combination of the other answers that lead me here. If this answer helps you in the future, please upvote the other posts also.

The basic underlying rule is that XML with Unicode characters should be passed to, and parsed as, Unicode by SQL Server. Therefore C# should generate XML as UTF-16; the SSMS and .Net default.

Cause of original problem

This variable declares XML with UTF-8 encoding, but the entity en-dash cannot be used without being encoded in UTF-8. This is wrong:

DECLARE @badxml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';

XML parsing: line 3, character 29, illegal xml character

Another approach that doesn't work is to switch UTF-8 to UTF-16 in the XML. The string here is not unicode, so the implicit conversion fails:

DECLARE @xml xml = '<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';

XML parsing: line 1, character 56, unable to switch the encoding

Solutions

Alternatives that work are:

1) Leave as UTF-8 but encode with hexadecimal on the entity (reference):

DECLARE @xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
  <r RecordName="Option &#x2013; Bar" />
</records>';

2) As above but with decimal encoding on the entity (reference):

DECLARE @xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
  <r RecordName="Option &#8211; Bar" />
</records>';

3) Include the original entity, but remove UTF-8 encoding in declaration (SSMS then applies UTF-16; its default):

DECLARE @xml xml = '<?xml version="1.0" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';

4) Retain the UTF-16 declaration, but cast the XML to Unicode (note the preceding N before casting as XML):

DECLARE @xml xml = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';

Solution 3

SQL Sever internally uses UTF-16. Either let the encoding away or cast to unicode

The reason you are looking for: With UTF-8 specified, this character is not known.

--without your directive, SQL Server picks its default
declare @xml XML = 
'<records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>';
select @xml;

--or UNICODE, but you must use UTF-16
declare @xml2 XML = 
CAST('<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>' AS NVARCHAR(MAX));

select @xml2

UPDATE

UTF-8 means, that there are chunks of 8 bits used to carry information. The base characters are just one chunk, easy going...

Other characters can be encoded as well. There are "c2" and "c3" codes (look here). c3-codes need three chunks to be encoded. But the internally used UTF16 expects 2 byte encoded characters.

Hope this is clear now...

UPDATE 2

This code will show you, that the Hyphen has the ASCII code 45 and your en-dash 150:

DECLARE @x VARCHAR(100)=
'<r RecordName="Option - Foo" /><r RecordName="Option – Bar" />';

WITH RunningNumbers AS
(
    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nmbr
    FROM sys.objects
)
SELECT SUBSTRING(@x,Nmbr,1), ASCII(SUBSTRING(@x,Nmbr,1)) AS ASCII_Code
FROM RunningNumbers
WHERE ASCII(SUBSTRING(@x,Nmbr,1)) IS NOT NULL;

Have a look here All characters with 7 bits are "plain" and should encode without problems. The "extended ASCII" is depending on code tables and could vary. 150 might be en-dash or something else. UTF8 uses some tricky encodings to allow strange characters to be "legal". Obviously (this was new to me too) the internally used UTF16 cannot cope with c3-characters.

Solution 4

The MSDN guidelines says:

SQLXML 4.0 relies upon the limited support for DTDs provided in SQL Server. SQL Server allows for an internal DTD in xml data type data, which can be used to supply default values and to replace entity references with their expanded contents. SQLXML passes the XML data "as is" (including the internal DTD) to the server. You can convert DTDs to XML Schema (XSD) documents using third-party tools, and load the data with inline XSD schemas into the database.

Share:
13,997
EvilDr
Author by

EvilDr

Apparently, this user prefers to scream, "Groovy" at medieval airheads.

Updated on July 24, 2022

Comments

  • EvilDr
    EvilDr almost 2 years

    This is not a question on how to overcome the "XML parsing: ... illegal xml character" error, but about why it is happening? I know that there are fixes(1, 2, 3), but need to know where the problem arises from before choosing the best solution (what causes the error under the hood?).

    We are calling a Java-based webservice using C#. From the strongly-typed data returned, we are creating an XML file that will be passed to SQL Server. The webservice data is encoding using UTF-8, so in C# we create the file, and specify UTF-8 where appropriate:

    var encodingType = Encoding.UTF8;
    // logic removed...
    var xdoc = new XDocument();
    xdoc.Declaration = new XDeclaration("1.0", encodingType.WebName, "yes");
    // logic removed...
    System.IO.File.WriteAllText(xmlFullPath, xdoc.Declaration.ToString() + xdoc.Document.ToString(), encodingType);
    

    This creates an XML file on disk that has contains the following (abbreviated) data:

    <?xml version="1.0" encoding="utf-8" standalone="yes"?>
    <records>
      <r RecordName="Option - Foo" />
      <r RecordName="Option – Bar" />
    </records>
    

    Notice that in the second record, - is different to . I believe the second instance is en-dash.

    If I open that XML file in Firefox/IE/VS2015. it opens without error. The W3C XML validator also works fine. But, SSMS 2012 does not like it:

    declare @xml XML = '<?xml version="1.0" encoding="utf-8" standalone="yes"?><records>
      <r RecordName="Option - Foo" />
      <r RecordName="Option – Bar" />
    </records>';
    

    XML parsing: line 3, character 25, illegal xml character

    So why does en-dash cause the error? From my research, it would appear that

    ...only a few entities that need escaping: <,>,\,' and & in both HTML and XML. Source

    ...of which en-dash is not one. An encoded version (replacing with &#8211;) works fine.

    UPDATE

    Based on the input, people state that en-dash isn't recognised as UTF-8, but yet it is listed here http://www.fileformat.info/info/unicode/char/2013/index.htm So, as a perfectly legal character, why won't SSMS read it when passed as XML (using UTF-8 OR UTF-16)?

    • Martin Honnen
      Martin Honnen about 8 years
      Do you get the same SQL error when omitting the XML declaration and solely feeding declare @xml XML = '<records> <r RecordName="Option - Foo" /> <r RecordName="Option – Bar" /> </records>';?
    • EvilDr
      EvilDr about 8 years
      It works when the declaration is omitted. BUT - isn't that bad practice potentially leading to other issues? Its not the fix so much that I'm after, but the cause.
    • smoore4
      smoore4 about 8 years
      Or try encoding="utf-16" .NET is probably not the problem. It is MSSQL.
    • smoore4
      smoore4 about 8 years
      This is the background on why.....stackoverflow.com/questions/3760788/…
  • EvilDr
    EvilDr about 8 years
    Not sure I understand. What action is needed on my behalf here please? Do I need to define a DTD?
  • EvilDr
    EvilDr about 8 years
    No, it throws an error about being unable to switch the encoding type, which is expected.
  • EvilDr
    EvilDr about 8 years
    Yes it does. I know this already and its a handy workaround, the question was about why its happening?
  • Rahul Tripathi
    Rahul Tripathi about 8 years
    @EvilDr:- yes you can give it a try else the workaround will be to make it unicode using N as Alex pointed.
  • EvilDr
    EvilDr about 8 years
    Is there a standard DTD for this type of thing? .Net certainly doesn't generate it for me
  • Rahul Tripathi
    Rahul Tripathi about 8 years
    @EvilDr:- I don't think that there exist any or may be I am not aware of it.
  • EvilDr
    EvilDr about 8 years
    Yes I read this as a workaround, but the question is why the behavior occurs. Please see update in question for more.
  • EvilDr
    EvilDr about 8 years
    Please see question update. If its a valid UTF8 character, why does it trip up SSMS?
  • EvilDr
    EvilDr about 8 years
    With UTF8, it should be known as its listed as a UTF8 character here: fileformat.info/info/unicode/char/2013/index.htm. Its all very strange to me... :-/
  • Shnugo
    Shnugo about 8 years
    @EvilDr Where do you see that it is a legal UTF8 character? As far as I see your link lists escape sequences for different encodings. Doesn't this 0xE2 0x80 0x93 (e28093) show clearly, that it is not a legal UTF8 character?
  • EvilDr
    EvilDr about 8 years
    I just read the title, "Unicode Character 'EN DASH'". I'm probably wrong; I'll be dreaming about the damn character for the next month
  • EvilDr
    EvilDr about 8 years
    I don't get how the source (UTF8) system then can provide me with that illegal character, or not throw an error of some sort when that entity is created?
  • Alex K.
    Alex K. about 8 years
    utf-8 is not a subset of unicode, the dash is 3 bytes in utf-8 which is perfectly fine as utf-8 is a variable length encoding. Its 2 bytes in utf-16 however and that fact is probably why there is an issue with the encoding of the string literal and the xml parser.
  • Alex K.
    Alex K. about 8 years
    Updated the answer
  • EvilDr
    EvilDr about 8 years
    Lovely answer thanks. I feel sadly I must accept Alex's as it was posted slightly before. Thank you for all your help - a real valuable learning experience.
  • Shnugo
    Shnugo about 8 years
    @EvilDr No problem, happy coding!