illegal xml character on SQL Insert

28,728

Solution 1

I'm going to strip the header...

I'm having the same issue with a funny little apostrophe thing. I think the issue is that by the time the string is getting converted to XML, it's not UTF-8 anymore, but sql server is trying to use the header to decode it. If it's VARCHAR, it's in the client's encoding. If it's NVARCHAR, it's UTF-16. Here are some variations I tested:

SQL (varchar, UTF-8):

SELECT CONVERT(XML,'<?xml version="1.0" encoding="UTF-8"?><t>We’re sorry</t>')

Error:

XML parsing: line 1, character 44, illegal xml character

SQL (nvarchar, UTF-8):

SELECT CONVERT(XML,N'<?xml version="1.0" encoding="UTF-8"?><t>We’re sorry</t>')

Error: XML parsing: line 1, character 38, unable to switch the encoding

SQL (varchar, UTF-16)

SELECT CONVERT(XML,'<?xml version="1.0" encoding="UTF-16"?><t>We’re sorry</t>')

Error:

XML parsing: line 1, character 39, unable to switch the encoding

SQL (nvarchar, UTF-16)

SELECT CONVERT(XML,N'<?xml version="1.0" encoding="UTF-16"?><t>We’re sorry</t>')

Worked!

Solution 2

Have a look at this link from w3, it tells me that:

In HTML, there is a list of some built-in character names like &eacute; for é but XML does not have this. In XML, there are only five built-in character entities: &lt;, &gt;, &amp;, &quot; and &apos; for <, >, &, " and ' respectively. You can define your own entities in a Document Type Definition, or you can use any Unicode character (see next item).

In HTML, there are also numeric character references, such as &#38; for &. You can refer to any Unicode character, but the number is decimal, whereas in the Unicode tables the number is usually in hexadecimal. XML also allows hexadecimal references: &#x26; for example.

This leads me to believe that, &#xE9; might work for an é character.

Also the information at this link from Microsoft states that:

SQLXML 4.0 relies upon the limited support for DTDs provided in SQL Server. SQL Server allows for an internal DTD in xml data type data, which can be used to supply default values and to replace entity references with their expanded contents. SQLXML passes the XML data "as is" (including the internal DTD) to the server. You can convert DTDs to XML Schema (XSD) documents using third-party tools, and load the data with inline XSD schemas into the database.

But all this does not help you if you don't have control over the incoming XML stream. I doubt that it is possible to save an é (or any special character for that matter, except for the built in character entities mentioned above) inside an XML document into an SQL Server XML field, without either adding a DTD or replacing the character with its hexadecimal reference counterpart. In both cases you would need to be able to modify the XML before it goes into the database.

Just a quick example for anyone wanting to go down the "adding a DTD" route.

Here's how to add an internal DTD to an xml file which declares an entity for an é character:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root [<!ENTITY eacute "&#233;">]>
<root>
  <RegionName>Qu&eacute;bec</RegionName>
</root>

If you go here and search on the page "Ctrl+F" for "eacute", you end up in a list with examples for other characters which you could just copy and paste into your own internal DTD.

Edit

You could off course add all entities as they are specified at the link above: <!ENTITY eacute "&#233;"><!ENTITY .. // Next entity>, or just copy them all from this file. I do understand how adding an internal DTD to every single XML file you add to the database isn't such a good idea. I would be interested to know if adding it for 1 file fixes your issue though.

Share:
28,728
James A Mohler
Author by

James A Mohler

ColdFusion programmer, photographer, cat owner (Sorry, I had to move. Cat is OK, but I am no longer with him. New place does not allow cats), public transportation enthusiast, and all around nice guy. Creator of the best ColdFusion Landing page ever! See: https://github.com/jmohler1970/Landing/wiki Inspiration This is how I feel when I program: https://www.youtube.com/watch?v=HUuH3itKZ8E (Thank you, Olivia!) Some Vue JS Inspiration Firestore + Vue Authentication Best ColdFusion Resources An Architech's View by Sean Corfield FW/1 The Blog of Ben Nadel on obsessively thorough web application development Forta.com by Ben Forta Riaforge Adobe ColdFusion Blog run by Adobe Adobe ColdFusion run by Adobe Lucee Railo Railo's Blog hostek.com Bootstrap Blog Pit of Success OWASP Best Firebase https://angularfirebase.com/lessons/firestore-nosql-data-modeling-by-example/ My Work My latest work on Github brings Bootstrap and ColdFusion together https://github.com/jmohler1970/ May 2016, I updated WordExtractor: https://github.com/jmohler1970/WordExtractor Something I want to look at: https://restfulapi.net https://www.bugsnag.com

Updated on July 18, 2022

Comments

  • James A Mohler
    James A Mohler almost 2 years

    I am trying to insert the following string into an sql xml field

    <?xml version="1.0" encoding="UTF-8"?>
    <Response> 
        <Ip>x.x.x.x</Ip>
        <CountryCode>CA</CountryCode> 
        <CountryName>Canada</CountryName>
        <RegionCode>QC</RegionCode> 
        <RegionName>Québec</RegionName> 
        <City>Dorval</City> 
        <ZipCode>h9p1j3</ZipCode> 
        <Latitude>45.45000076293945</Latitude> 
        <Longitude>-73.75</Longitude> 
        <MetroCode></MetroCode> 
        <AreaCode></AreaCode> 
    </Response>
    

    The insert code looks like:

    INSERT 
        INTO Traffic(... , xmlGeoLocation, ...)
        VALUES (
            ...
            <!--- 
            <cfqueryparam CFSQLType="cf_sql_varchar" value="#xmlGeoLocation#">, 
            --->
            '#xmlGeoLocation#',
    
            ...
            )
    

    Two bad things happen:

    1. Québec gets turned into Québec

    2. I get an error saying [Macromedia][SQLServer JDBC Driver][SQLServer]XML parsing: line 8, character 16, illegal xml character

    UPDATE:

    The incoming test stream is mostly single byte characters.

    The é is a two byte character. In particular C3A9

    Also I don't have control over the incoming xml stream

  • James A Mohler
    James A Mohler about 11 years
    I don't control the incoming xml stream
  • James A Mohler
    James A Mohler about 11 years
    I am concerned about what happens when the next special character comes along
  • Mathijs Flietstra
    Mathijs Flietstra about 11 years
    I've edited the answer to address your comment, it's not really a solution, but I do wonder if any of this lets you insert the XML with the special character.
  • James A Mohler
    James A Mohler about 11 years
    Your solution solves the question that was asked, but really I need a general solution
  • Adam Cameron
    Adam Cameron about 11 years
    I don't believe this is the solution really. There are no special entities for things like é in XML because they don't need to be escaped. The problem here is down to the data being encoding a certain wait (most likely UTF-8), and either CF or the DB not treating it as UTF-8. Don't worry about the content of the XML, worry about how the raw data's encoding is being handled. I expect the fault lies with ColdFusion or the CFML being used to process the XML because CF and CF developers (sorry James ;-) are notoriously bad at dealing with this sort of thing.
  • Leigh
    Leigh about 11 years
    @AdamCameron - I do not think the raw data's encoding is the complete source of the problem. SQL Server is throwing the error, even when the string is properly encoded as UTF-8. However, once you remove the encoding="UTF-8" declaration from the xml it works just fine. Given that SQL Server does not store the xml header anyway, I would say removing it seems like a reasonable solution.
  • Dominic Fitzpatrick
    Dominic Fitzpatrick over 6 years
    Due to enforced workflow we used the principles of your answer and replaced the header after receiving it as a varchar(max) like this @webtemp = REPLACE(@webtemp, '<?xml version="1.0" encoding="utf-8"?>', '<?xml version="1.0"?>'). It might need tailoring of course. This worked.
  • Eva Donaldson
    Eva Donaldson over 3 years
    Countless hours searching and the answer was here. I did as @DominicFitzpatrick recommended and removed encoding="utf-8" and it magically worked.