Why does .net use the UTF16 encoding for string, but uses UTF-8 as default for saving files?

45,031

Solution 1

If you're happy ignoring surrogate pairs (or equivalently, the possibility of your app needing characters outside the Basic Multilingual Plane), UTF-16 has some nice properties, basically due to always requiring two bytes per code unit and representing all BMP characters in a single code unit each.

Consider the primitive type char. If we use UTF-8 as the in-memory representation and want to cope with all Unicode characters, how big should that be? It could be up to 4 bytes... which means we'd always have to allocate 4 bytes. At that point we might as well use UTF-32!

Of course, we could use UTF-32 as the char representation, but UTF-8 in the string representation, converting as we go.

The two disadvantages of UTF-16 are:

  • The number of code units per Unicode character is variable, because not all characters are in the BMP. Until emoji became popular, this didn't affect many apps in day-to-day use. These days, certainly for messaging apps and the like, developers using UTF-16 really need to know about surrogate pairs.
  • For plain ASCII (which a lot of text is, at least in the west) it takes twice the space of the equivalent UTF-8 encoded text.

(As a side note, I believe Windows uses UTF-16 for Unicode data, and it makes sense for .NET to follow suit for interop reasons. That just pushes the question on one step though.)

Given the problems of surrogate pairs, I suspect if a language/platform were being designed from scratch with no interop requirements (but basing its text handling in Unicode), UTF-16 wouldn't be the best choice. Either UTF-8 (if you want memory efficiency and don't mind some processing complexity in terms of getting to the nth character) or UTF-32 (the other way round) would be a better choice. (Even getting to the nth character has "issues" due to things like different normalization forms. Text is hard...)

Solution 2

As with many "why was this chosen" questions, this was determined by history. Windows became a Unicode operating system at its core in 1993. Back then, Unicode still only had a code space of 65535 codepoints, these days called UCS. It wasn't until 1996 until Unicode acquired the supplementary planes to extend the coding space to a million codepoints. And surrogate pairs to fit them into a 16-bit encoding, thus setting the utf-16 standard.

.NET strings are utf-16 because that's an excellent fit with the operating system encoding, no conversion is required.

The history of utf-8 is murkier. Definitely past Windows NT, RFC-3629 dates from November 1993. It took a while to gain a foot-hold, the Internet was instrumental.

Solution 3

UTF-8 is the default for text storage and transfer because it is a relatively compact form for most languages (some languages are more compact in UTF-16 than in UTF-8). Each specific language has a more efficient encoding.

UTF-16 is used for in-memory strings because it is faster per character to parse and maps directly to unicode character class and other tables. All string functions in Windows use UTF-16 and have for years.

Share:
45,031
Royi Namir
Author by

Royi Namir

Updated on July 05, 2022

Comments

  • Royi Namir
    Royi Namir almost 2 years

    From here

    Essentially, string uses the UTF-16 character encoding form

    But when saving vs StreamWriter :

    This constructor creates a StreamWriter with UTF-8 encoding without a Byte-Order Mark (BOM),

    I've seen this sample (broken link removed):

    enter image description here

    And it looks like utf8 is smaller for some strings while utf-16 is smaller in some other strings.

    • So why does .net use utf16 as default encoding for string and utf8 for saving files?

    Thank you.

    p.s. Ive already read the famous article

  • gbjbaanb
    gbjbaanb about 11 years
    the point of UTF-8 is that, if you need 6 bytes per character to truly represent all possibilities, then anything less than UTF-32 is a problem that needs special cases and extra code. So UTF-16 and UTF-8 are both imperfect. However, as UTF-8 is half the size, you might as well use that. You gain nothing by using UTF-16 over it (except increased file/string sizes). Of course, some people will use UTF-16 and ignorantly assume it handles all characters.
  • Royi Namir
    Royi Namir about 11 years
    Can you please elaborate on "UTF-8 in the string representation, converting as we go" ? Both of them (utf8,16) has variable width...
  • Royi Namir
    Royi Namir about 11 years
    I've read it 14 times. still I don't understand this line : the size per code unit being constant . AFAIK the size can be 2,3,4 bytes (in utf-16) so what is constant here ?
  • gbjbaanb
    gbjbaanb about 11 years
    I think he means UCS-16, which is what Windows calls "Unicode" - ie a fixed 2-byte-per-character encoding. Back in the day, we thought this was enough to store all character encodings. We were wrong, hence UTF-8 being a internet "standard" now.
  • Jon Skeet
    Jon Skeet about 11 years
    @gbjbaanb: No, .NET uses UTF-16. So when anything outside the BMP is required, surrogate pairs are used. Each character is a UTF-16 code unit. (As far as I'm aware there's no such thing as UCS-16 either - I think you mean UCS-2.)
  • Jon Skeet
    Jon Skeet about 11 years
    @RoyiNamir: No, the size of a UTF-16 code unit is always 2 bytes. A Unicode character takes either one code unit (for the Basic Multilingual plane) or two code units (for characters U+10000 and above).
  • Royi Namir
    Royi Namir about 11 years
    phewww...Thanks Jon. I though you forgot me.... Now it is clear .Again Thanks a lot.
  • Royi Namir
    Royi Namir over 10 years
    But Jon , looking at the table in my question , lets say the "hello world" was saved in an xml file which was saved with utf 8 encoding . Later , i open the file in vs and i do(!) see the xml with the ״hello world" -- so , visual studio efitor knew(!) how to open the file . It knew how to decode the bytes on the hard drive . So --- why still i need to declare the charset encoding tag- at the top of the xml ?
  • Jon Skeet
    Jon Skeet over 10 years
    @RoyiNamir: Specifically for XML, you don't need to have an encoding tag if you're using UTF-8 or UTF-16. The specification explains how the encoding can be inferred from the first few characters. For any other encoding, you must include the encoding. (Note that your question doesn't mention XML anywhere...)
  • Royi Namir
    Royi Namir over 10 years
    Jon , I'm sorry but why other encodings are different ? I mean , if my program reads a remote file as bytes , and the encoding can be inferred from the first bytes , then why mention it again at the tag ? It seems weird to me the fact that i can open a safe via a clue(first bytes) , and when i open it , i see again a clue inside (charset tag)... . I must be missing something here . :-(
  • Jon Skeet
    Jon Skeet over 10 years
    @RoyiNamir: It can only be inferred between UTF-8 and UTF-16, and those are the only ones the XML specification dictates are okay to leave out. And that's just for XML - for other text files, editors are left to guess heuristically, and can get it wrong.
  • Royi Namir
    Royi Namir over 10 years
    @JonSkeet: But If they can get it wrong (and let's say they are getting it wrong)-- how would they know to "read properly" the encoding data at <...charset=... section ? ( thank you for all the answers. I didnt find ANY article which answer my latest questions here. )
  • Jon Skeet
    Jon Skeet over 10 years
    @RoyiNamir: To be honest, this should all be as a new question, given that it's XML-specific. But I believe there's an assumption that the other encodings will be compatible with ASCII, but I suspect that XML parsers which support non-ASCII-compatible encodings can try those as well. See w3.org/TR/xml/#charencoding for more information.
  • Guy
    Guy over 8 years
    @JonSkeet: what am I missing? as far as I know, UTF-16 can be 2 or 4 bytes. All internet resources show the same thing: Wiki or Unicode.org
  • Jon Skeet
    Jon Skeet over 8 years
    @gMorphus: A UTF-16 code unit is always 2 bytes. A Unicode code point is represented by one or two UTF-16 code units. I'm not sure which part of either the answer or the comments you're disagreeing with.
  • Fernando Pelliccioni
    Fernando Pelliccioni almost 8 years
    @JonSkeet "It could be up to 6 bytes". In the Thompson-Pike UTF-8 proposal (Ken Thompson and Rob Pike) the posible range of characters was [0, 7FFFFFFF], requiring up to 6-bytes (or octets: 8-bit bytes). In 2003, the range of characters was restricted to [0, 10FFFF] (the UTF-16 accessible range). See: tools.ietf.org/html/rfc3629 So, all characters are encoded using sequences of 1 to 4 octets. Not 6.
  • Fernando Pelliccioni
    Fernando Pelliccioni almost 8 years
    @JonSkeet, RoyiNamir said: "(utf8,16) has variable width". I understand he means that UTF-16 not a variable-width encoding, and he is right. You answered: "the size of a UTF-16 code unit is always 2 bytes". And..., the size of a UTF-8 code unit is always 1 byte.
  • Jon Skeet
    Jon Skeet almost 8 years
    @FernandoPelliccioni: How do you define "variable-width encoding" precisely? Having just reread definitions, I agree I was confused about the precise meaning of "code unit" but both UTF-8 and UTF-16 are variable width in terms of "they can take a variable number of bytes to represent a single Unicode code point". For UTF-8 it's 1-4 bytes, for UTF-16 it's 2 or 4. Will check over the rest of my answer for precision now.
  • Jon Skeet
    Jon Skeet almost 8 years
    @FernandoPelliccioni: I've fixed the "up to 6 bytes part" btw.
  • Jon Skeet
    Jon Skeet almost 8 years
    @FernandoPelliccioni: Thanks for the prod to revisit this, btw - and always nice to get more precise about terms
  • Fernando Pelliccioni
    Fernando Pelliccioni almost 8 years
    Thank you @JonSkeet. You're always so kind to help others with your knowledge. Here are some references I consider a good read. (Beyond the standards) utf8everywhere.org programmers.stackexchange.com/questions/102205/…
  • Fernando Pelliccioni
    Fernando Pelliccioni almost 8 years
    @JonSkeet Here another reference of the meaning of "variable-width encoding" by the Unicode guys. unicode.org/versions/Unicode8.0.0/UnicodeStandard-8.0.pdf (see pages [36-39])
  • Jon Skeet
    Jon Skeet almost 8 years
    @FernandoPelliccioni: Right, that concurs that UTF-16 is variable-width. "The distinction between characters represented with one versus two 16-bit code units means that formally UTF-16 is a variable-width encoding form."
  • Fernando Pelliccioni
    Fernando Pelliccioni almost 8 years
  • Nyerguds
    Nyerguds about 3 years
    Does .Net actually use UTF-16 internally though? From what I've seen, Char is just a 16-bit struct, and String is just an array of Char. There's no variable width, it's a plain dump of unicode code points that can't go above 0xFFFF.
  • Jon Skeet
    Jon Skeet about 3 years
    @Nyerguds: char is a 16-bit struct, yes. But it uses surrogate pairs, so the number of Unicode code points in a string is not just the number of chars, unless you count each half as its own code point.