Setting encoding in XML files

12,522

Solution 1

If all fails, read the spec :-).

4.3.3 Character Encoding in Entities

Each external parsed entity in an XML document may use a different encoding for its characters.

[...]

In an encoding declaration, the values " UTF-8 ", " UTF-16 ", " ISO-10646-UCS-2 ", and " ISO-10646-UCS-4 " SHOULD be used for the various encodings and transformations of Unicode / ISO/IEC 10646, the values " ISO-8859-1 ", " ISO-8859-2 ", ... " ISO-8859- n " (where n is the part number) SHOULD be used for the parts of ISO 8859, and the values " ISO-2022-JP ", " Shift_JIS ", and " EUC-JP " SHOULD be used for the various encoded forms of JIS X-0208-1997.

It is RECOMMENDED that character encodings registered (as charsets) with the Internet Assigned Numbers Authority IANA-CHARSETS, other than those just listed, be referred to using their registered names; other encodings SHOULD use names starting with an "x-" prefix.

Source: http://www.w3.org/TR/REC-xml/

So UTF-8 is written as encoding="UTF-8".

For other character sets not listed above, use the names given in the IANA character set list.

Case of the letters in the character set name is not significant: "However, no distinction is made between use of upper and lower case letters." (IANA character set list). So you could also write encoding="uTf-8" if you feel like it ;-).

BTW: Are you really, really certain you want to write your own XML parser? This sounds suspiciously like reinventing the wheel.

Solution 2

<?xml version="1.0" encoding="utf-8"?>

should be fine for utf-8.

Share:
12,522
Albus Dumbledore
Author by

Albus Dumbledore

For most of my time I do programming stuff, but I like math, too, especially if it’s got a more applied nature. I love jazz music and action-packed thrilling books, where the good guys are noble and able, but sound self-deprecating, and always think coolly and clearly. Most of all, however, I like video games with compelling atmosphere, innovative design and great eye for detail. I am best at Java, but I also have experience with C++, Python, Ruby, Visual Basic, Pascal, ActionScript and PHP. I have some idea of functional programming, too, as I’ve done some good amount of projects in Matlab and Mathematica. I prefer simpler code, but I am not too scared to go deep, if it’s the only option. My love for books and mobile devices has leaded me to making my own ebook reader: The AlbiteREADER. One can find free ebooks there, too. It’s a big thing for me, for I’ve been making the app for over four months. As far as math is concerned, I don’t like it raw, but prefer it in connection with other sciences, i.e. numerical analysis, discreet math, statistics, biomathematics, etc. I’ve done some good amount of math projects with Matlab and Mathematica. I’ve also had the chance to teach biomath as an assistant, i.e. I was responsible for the demonstrational part of the subject. In relation with that, I can say, I wrote some good quantity of Mathematica code and some lesser amount of mathematical stuff.

Updated on June 06, 2022

Comments

  • Albus Dumbledore
    Albus Dumbledore about 2 years

    Which are the valid xml encoding strings? For instance, what is the way of specifying UTF-8:

    • encoding="utf8"
    • encoding="utf8"
    • etc

    Or Windows 1251:

    • encoding="windows-1251"
    • encoding="windows1251"
    • encoding="cp-1251"
    • etc.

    I am making a character decoder as well as a xml parser. Thus, I need to be able to set the encoding of my StreamReader based on the value from the encoding attribute.

    Any ideas where I could find a list of the official encoding string?

    The best I could find is this, but it seems to be IE specific.

    Thanks!