Why are "control" characters illegal in XML 1.0?

xml unicode history

38,527

Solution 1

My understanding is that this range is barred on the grounds that a markup language should not have any need to support transmission and flow control characters and including them would create a problem for any editors and parsers in binary conversion.

I'm struggling to find anything ex cathedra on this from Tim Bray et al though.

edit: some discussion of control chars and a vague admission it wasn't exactly over-engineered:

At 09:27 AM 17/06/00 -0500, Mark Volkmann wrote:

I've never seen a discussion of the reason why most ASCII control characters, such as a form feed, are not allowed in XML documents. Can anyone tell me the reason behind that decision or point me to a spec. that explains that?

I'm not sure we'd do it the same way if we were doing it again. I don't see that they do any real harm. Clearly, if you're optimizing for a highly interoperable content markup language (and XML is) it's legitimate to be suspicious of things like vertical-tab and backspace and so on... but then how can it be consistent to leave in \n and DEL and so on? -Tim

Solution 2

It seems like it could have been required that they be encoded in escapes, e.g. as  and 

You can do exactly that in XML 1.1, for all but \0.

Solution 3

It's probably time to resummarize, also with a view at XML 1.1.

What control character code points are there in Unicode?

U+0000 to U+001f, inherited from ASCII.
U+007F, inherited from ASCII
U+0080 to U+009F, inherited from Latin-1
various special purpose ranges, standardized explicitly for Unicode, and mostly useful especially in non-markup contexts. They are discussed here block by block, including reasons why and how to use them or to not use them in XML and what to do if you run into them anyway.

How does XML look at those control characters?

This is a different classification.

Tab and newline (regardless of the platform dependency of what's a newline) are good. Everybody uses them. Everybody knows what they are supposed to stand for. Allowed in almost all known forms, often even for pretty printing of the markup itself.
U+0000 is evil. Null character? String terminator? Binary noise? Antithesis to both interoperability and markup. Forbidden in all forms.
Anything else? Scarcely used, problematic interoperability, but there are ways to tolerate them even without knowing much about what they are supposed to "control".

Let's now switch our attention to this last category only, control codes proper. That is, the following summary does NOT apply to tabs and newlines: U+0009, U+000a, U+000D, U+0085, U+2028.

XML 1.0 allows all the above ranges of control characters, except U+0000 to U+001f, as text (directly included characters), and as numeric character references. Allowing U+007F to U+009F was apparently by omission and this inconsistency was corrected in XML 1.1, but the other way round. They even gave a detailed rationale inside the standard:

Finally, there is considerable demand to define a standard representation of arbitrary Unicode characters in XML documents. Therefore, XML 1.1 allows the use of character references to the control characters #x1 through #x1F, most of which are forbidden in XML 1.0. For reasons of robustness, however, these characters still cannot be used directly in documents. In order to improve the robustness of character encoding detection, the additional control characters #x7F through #x9F, which were freely allowed in XML 1.0 documents, now must also appear only as character references. (Whitespace characters are of course exempt.) The minor sacrifice of backward compatibility is considered not significant. Due to potential problems with APIs, #x0 is still forbidden both directly and as a character reference.

Why does Unicode and XML allow free use of markup-like control characters, apart from the few "inherited" ranges? People should be using markup for those.

Unicode is also used in non-markup contexts, and it is a still evolving character set. It would be too difficult to implement a conforming XML processor if the set of non-control characters was a moving target.

OK, what's wrong with the inherited ranges then, compared to the Unicode-specific control characters?

Lack of standardization. The Unicode consortium didn't really get to choose which numbers are assigned to those "characters", or what is their typical visual presentation or meaning. Full backward compatibility with ASCII (on encoded UTF-8 level) and with Latin-1 (on code point assignment level) forced raw inclusion of these code points regardless of the various specialized and overloaded meanings often attached to them in various text processing contexts.

Wait, are you saying that XML isn't meant to be fully backward compatible with ASCII, unlike UTF-8?

Yeah. That's correct. You need a document element. You can't even put in a raw < or &. So why would you ever need to put in raw control characters?

Solution 4

XML was designed specially around Unicode (specifically UTF-8 and UTF-16) and ISO/IEC 10646, both of which (I'm not quite positive about ISO 10646) contain the transmission/flow control characters which were left over from ASCII and the days of character-based terminals. While those characters still have uses, they don't belong in a format like XML.

As for these new encodings that use those codes for something else, well, it seems that the XML spec may need to adapt.

Solution 5

Why are you double-escaping them? This seems like a good place for &bell; and &escape;. (Undefined, handled by callback from the parser to your code)

View more solutions

38,527

Author by

Trochee

Computational linguist. Linux (Ubuntu, specifically) is tool-of-choice, but slowly getting the hang of OS-X. An old Perl user, learning Python. Distributed methods, statistical engineering, natural language processing, speech processing.

Updated on March 02, 2020

Comments

Trochee about 4 years

There are a variety of characters that are not legally encodeable in XML 1.0, e.g. U+0007 ('bell') and U+001B ('escape'). Most of the interesting ones are non-whitespace 'control' characters.

It's clear from (e.g.) this question and others that it's the XML spec that's the issue -- but can anyone illuminate me as to why the XML spec forbids these characters?

It seems like it could have been required that they be encoded in escapes, e.g. as  and  respectively, but perhaps there's a practical reason that the characters were forbidden rather than required to be escaped?

Answerers have suggested that there is some motivation towards avoiding transmission control characters, but Unicode includes many other control-like characters (consider U+200C "zero width non joiner"). I recognize there may be no good reason for this behavior, but I would still like to understand it better.

It's particularly frustrating because when those character values appear in other ~~encodings~~ data formats, I end up "double-escaping" new XML documents that need to encode this.
Trochee over 15 years

thank you for that thought -- I have updated the question to reflect my understanding of control vs other characters. I would welcome 'ex cathedra' links though!
Trochee over 15 years

please see my update of the question above -- if they were designed around Unicode and ISO-10646, why not support the entirety of the standard?
annakata over 15 years

your understanding is not at fault, but try and adjust your thinking to how those characters could make sense in a markup language and you'll see they can't - there is no spoon, as it were (still looking for links btw)
Trochee over 15 years

but as a markup language for data -- and XML is that -- those characters are no different from some other control characters, so it seems like a design error/inconsistency, as the links you provide suggest. Thank you for those links.
Chad Wellington about 14 years

Specifically, 0x1-0x1F and 0x7F-0x9F must be encoded as escapes in XML 1.1. The former were forbidden and the latter were optionally not-escaped in 1.0.
B T over 11 years

I believe what he is saying is that, despite the fact that the standard was built around unicode, they considered the control characters a bad thing. I tend to agree - control "characters" aren't characters at all - simply machine-specific binary codes. They really don't even have any place in Unicode - or ASCII for that matter.
foxxtrot over 11 years

B T, that is basically what I was trying to say. annakata's answer, which is both higher voted AND accepted, ends up making my point much more clearly.
Robe Elckers about 10 years

Thanks for this! This works as I wanted it, I can now use  to encode an ETX character.
把友情留在无盐 almost 9 years

Interesting. Are existing utils/libs out there jamming blobs into xml this way?
Spike0xff over 8 years

also see Tim Bray's answer, elsewhere on this page. (but is it ex cathedra...)
Guildenstern about 5 years

@Trochee “but as a markup language for data”—XML seems in practice to be a metalanguage for marking up textual data, not data in general. Excluding control characters like “Data Link Escape” seems sensible in that light. However, I agree with you (in your OP) that excluding characters like U+200C "zero width non joiner" makes less sense.