size of char type in c#

c# .net character-encoding

52,998

Solution 1

A char is unicode in C#, therefore the number of possible characters exceeds 255. So you'll need two bytes.

Extended ASCII for example has a 255-char set, and can therefore be stored in one single byte. That's also the whole purpose of the System.Text.Encoding namespace, as different systems can have different charsets, and char sizes. C# can therefore handle one/four/etc. char bytes, but Unicode UTF-16 is default.

Solution 2

I'm guessing with “other programming languages” you mean C. C has actually two different char types: char and wchar_t. char may be one byte long, wchar_t not necessarily.

In C# (and .NET) for that matter, all character strings are encoded as Unicode in UTF-16. That's why a char in .NET represents a single UTF-16 code unit which may be a code point or half of a surrogate pair (not actually a character, then).

Solution 3

Actually C#, or more accurately the CLR's, size of char is consistent with most other managed languages. Managed languages, like Java, tend to be newer and have items like unicode support built in from the ground up. The natural extension of supporting unicode strings is to have unicode char's.

Older languages like C/C++ started in ASCII only and only later added unicode support.

Solution 4

Because strings in .NET are encoded as 2 byte Unicode charactes.

Solution 5

Because a character in a C# string defaults to the UTF-16 encoding of Unicode, which is 2 bytes (by default).

View more solutions

52,998

Manish Basantani

You can find me, and more details about me - and the work I do, at my blog: http://www.devdumps.com/about-me/ "Any fool can write a code that computer can understand. Good programmers write code that human can understand" ~ Martin Fowler

Updated on October 09, 2020

Comments

Manish Basantani over 3 years

Just wondering why do we have char type of 2 bytes size in C# (.NET) unlike 1 byte in other programming languages?
- hippietrail about 9 years
  
  Why does C# use UTF-16 for strings?
Rich over 14 years

(a) Strings are sequences of characters. (b) There are no 2-byte Unicode characters. You may be looking for the terms code unit and code point. And with the latter, there are still no 16 bit, only 21.
Dawid Ohia over 14 years

So what is the relation between a C# character and Unicode code point?
Rich almost 13 years

A C# character is a UTF-16 code unit which may describe one Unicode code point or is half of a surrogate pair.
Rich about 11 years

With Unicode being a 21-bit code it's a bit of a stretch to say that that's why you need two bytes.
Cemafor almost 11 years

The charactors are represented using UTF-16, which means each charactor uses at least 16 bits or 2 bytes (even ASCII charactors which only require 7 bits). If the unicode value is larger enough, a single charactor that would print to the screen will actually require two chars.
Viktor Vix Jančík about 9 years

The first sentence in this answer ignores the existence of variable width characters.
hippietrail about 9 years

I would vote this up for the first, second, and final paragraphs; but I would vote it down for the third paragraph. It's still better than the other answers including the top/accepted answer though. P.S. you have a typo: "Now it you have a fixed character width".
Viktor Vix Jančík about 9 years

@hippietrail I'm curious, what about the 3rd paragraph do you believe is incorrect. Can you get a character at a specific location in a variable-width char string using better than O(n)?
hippietrail about 9 years

Because it argues that the reason was to have fixed length encoding in common scenarios. The legitimate scenarios for treating text as fixed length are few. There common ones are only toy, ignorant, and short-sighted ones that inevitably lead to bugs. Not only did the C# developers know this but C# development was only initiated three years after Unicode moved beyond 16-bit and Microsoft was a key member of the Unicode consortium all along. Now these were surely factors in the decision for UCS-2 for Java and Windows NT but for C# the reasons can only have been legacy and momentum.
Viktor Vix Jančík about 9 years

@hippietrail I'm not sure requiring a O(n) charAt() and other functions can be considered 'toy' or 'short-sighted'. At any rate, this is reason I got from compiler writers themselves ( not c# though ), but the same algorithmic limitations would apply.
hippietrail about 9 years

I'm constantly running into bugs due to experienced programmers making this assumption. That's how I can to this question yesterday. Now I would love to hear from the C# devs themselves and I would upvote whatever their reasons are. But guessing at their reasons we can both do and as you see we can guess differently, which makes our answers subjective. My speculation is their reasoning was "Java does it". Windows APIs use it. Type names are kinda set. We should just stick with that." In the end it has not left us with "the simplest solution" but with "a simplistic solution".
Viktor Vix Jančík about 9 years

I'm not really making an assumption. This is coming off an informed discussion. At any rate if you have a O(1) charAt() that can work with variable width chars and doesn't require further memory then please share. Java 9 may actually allow alternate 1 byte chars if the JVM believes all chars are 1 byte. This feature is being discussed right now, but has also been discussed in previous iterations.
hippietrail about 9 years

I've gone ahead and asked a question on Quora since it probably wouldn't be allowed here and I can never figure out what is and is not allowed on programmers.SE - let's see if it gets anything objective: quora.com/…
hippietrail about 9 years

In any case the point is not whether I can do charAt in O(1) on variable width characters, it's whether C#/.NET/CLR can know in advance whether a string passed to charAt will use variable width characters or not. The options are either 1) the function will be broken for non BMP, 2) provide a function for only BMP charAt and a function for non BMP charAt, 3) scan the string first to see if there's any non BMP character, and 4) provide a single function that works for both and makes no assumption. I'm neither a C# nor Java guy so I'm not sure which they do but I am a Unicode guy.
Viktor Vix Jančík about 9 years

@hippietrail I'd be very interested in anything you find on the topic. From my research variable width chars were eliminated right of the bat for performance in Java. And C# would have had to deal with the same limitations. There are also issues with supporting multiple char widths, serialization issues, etc.
hippietrail about 9 years

I actually found an interview with somebody on the team who says that C# strings are not C strings but BSTR strings, with the string length in a prefix. I didn't know that but perhaps you did. The reason was that it inherited it from Visual Basic! It was also related COM. He makes the same argument as you that this is a good thing. He says C# ignores that strings are UTF-16 and treats them as UCS-2 but I'm almost positive C#/.NET provide a whole suite of string functions that do indeed know about UTF-16 as well as older ones that don't: blog.coverity.com/2014/04/09/why-utf-16
Viktor Vix Jančík about 9 years

Well Eric is as authoritative as they get, so great find. The BSTR part basically adds the length in the front of the array. You are correct some functions treat C# strings as UTF-16 but as Eric eludes to they will break in rare situations. Eric frames it as historical, but I am sure if there was a better space to performance trade-off somewhere today other than fixed-width 16 bit, then he and the Java team would have taken it.
Zeke Lu almost 7 years

Reference: docs.microsoft.com/en-us/dotnet/api/…
Ankush Jain over 5 years

If the size is 2 byte or 16 bit then it can only hold characters having decimal code-point less than 2^16 = 65536. What if I want to store some character having code-point greater than this value. i.e. emojis
Parag Meshram over 3 years

@AnkushJain ibm.com/support/knowledgecenter/SSEPEK_11.0.0/char/src/tpc/… this might give you some clarity. UTF-16 means each character will take not exactly 2 bytes but at least 2 bytes. One character may take more than 2 bytes.
Paul Childs about 3 years

In C, char is always 1 byte long, it is just that that might not necessarily be 8 bits. And it is not just C. Basic Fortran, Pascal Cobol, C++, Objective C, OCaml, Clojure and, though of a lesser "programming" nature, a variety of SQL's. Kind of understandable for the OP to treat it as the norm.
Dwayne Robinson over 2 years

"Managed languages, like Java, tend to be newer" - Java 1.0 was 1996. C# was 2001. "C# ... size of char is consistent with most other managed languages." Java's char is also UTF-16.
Tom over 2 years

I was initially confused, because I thought that char represents the actual display character, but for some cases this is not true and this is why "🐂".Length = 2, although it has only one display character. So it is better to think of char as UFT-16 code unit, rather than of character of the unicode character set. (more examples can be found here: docs.microsoft.com/en-us/dotnet/standard/base-types/…) More motivation here: xoofx.com/blog/2017/02/06/stark-tokens-specs-and-the-tokeniz‌er/…