xml parse error on illegal character

16,680

The parser is correct, whatever produced the serialisation is wrong. As with most of the C0/C1 control characters, it is invalid—actually, worse than that: not well-formed—to put a U+001A SUBSTITUTE into an XML 1.0 file(*), even if encoded as a character reference such as .

No XML parser will read this, nor should it. Whilst you could put some horrific hack in to try to filter out  sequences before passing them to the parser, such crude hacks wouldn't work for the general case. The serialiser should be fixed to stop producing them.

Actually I have no idea how the character (often used to mark end-of-file in ancient horrible operating systems) would get into the dataset used by an ASP.NET app, but it wouldn't seem to play any valid role in a name, address or e-mail. Perhaps really you need to be looking at cleaning your data.

(*: It would be legal if encoded as a character reference in an XML 1.1 document. If you absolutely must round-trip control characters through XML, you will have to use XML 1.1. Though that may lead to compatibility issues with older XML parsers, and you still can't use the U+0000 NULL character, so you're never going to be completely binary-safe.)

Share:
16,680
bushman
Author by

bushman

Updated on July 26, 2022

Comments

  • bushman
    bushman almost 2 years

    SO, I am asking as a last resort, as I am completely out of ideas.

    I have a Windows ASP.NET ASMX web services app that returns a serialized Person object with a -- name, address, email... etc

    but some attributes in the xml are encoded very weirdly, for instance- &#x1a (I dont know where the encoding takes place. I assume in the serialization process)

    googling those characters I see that it is "Windows-1252" encoding.

    The problem occurs during parsing of the XML, I found, a parse error of "invalid unicode character" at the position of the 1252 encoding.

    how can I successfully parse it? what solutions do you suggest?

  • bushman
    bushman almost 14 years
    thank you for your detailed answer -- I am presuming the data was entered as a copy paste from a word file or something of that sort.
  • Amit Patil
    Amit Patil almost 14 years
    Yeah, that would be common for the C1 control codes in the range 0x80-0x9F (typically coming from code page 1252 smart quotes mis-interpreted as ISO-8859-1), but the 0x1A control code isn't used for anything by Word, or any other common modern Windows app I can think of.
  • bushman
    bushman almost 14 years
    so bob, I have no control over the data how it comes to me -- is the only way to have that horrific hack and remove it from the string or is there another way to represent it --- for example before the serialization -- check if the string is UTF-8 legal.
  • Amit Patil
    Amit Patil almost 14 years
    It's not an encoding issue: character U+001A is equally invalid in UTF-8, ISO-8859-1 or plain old 7-bit ASCII. You can remove the string &#x1A; with a simple string replace, but all attempts to handle XML with string/regex hacking risks breaking cases where it is not markup, such as in a <!-- comment -->, <?pi?> or <![CDATA section]]>. But you can't handle this input as XML, because with this control character in it, it simply isn't XML. If it is supposed to be XML, you need to find the party responsible for generating it and complain vociferously until they fix it.
  • bkwdesign
    bkwdesign over 7 years
    We are seeing this &#x1A; character escape/encoding come into our web based Order Entry system (asp.net mvc 4 / WCF) and it was from one of our specialists copy/pasting it from Outlook. In outlook it looks like a smart quote where someone was trying to indicate inches. It successfully saves to our database, but, when we try to select it out of SQL using CAST(theDataField as XML) we do get an error - (fwiw our DB field is not of type XML even though that's what we store in it, thus the cast that usually succeeds fails)