How do i get the decimal value of a unicode character in C#?
Solution 1
It's basically the same as Java. If you've got it as a char
, you can just convert to int
implicitly:
char c = '\u0b85';
// Implicit conversion: char is basically a 16-bit unsigned integer
int x = c;
Console.WriteLine(x); // Prints 2949
If you've got it as part of a string, just get that single character first:
string text = GetText();
int x = text[2]; // Or whatever...
Note that characters not in the basic multilingual plane will be represented as two UTF-16 code units. There is support in .NET for finding the full Unicode code point, but it's not simple.
Solution 2
((int)'அ').ToString()
If you have the character as a char
, you can cast that to an int
, which will represent the character's numeric value. You can then print that out in any way you like, just like with any other integer.
If you wanted hexadecimal output instead, you can use:
((int)'அ').ToString("X4")
X
is for hexadecimal, 4
is for zero-padding to four characters.
Solution 3
How do i get the numeric value of a unicode character in C#?
A char
is not necessarily the whole Unicode code point. In UTF-16 encoded languages such as C#, you may actually need 2 char
s to represent a single "logical" character. And your string lengths migh not be what you expect - the MSDN documnetation for String.Length Property says:
"The Length property returns the number of Char objects in this instance, not the number of Unicode characters."
- So, if your Unicode character is encoded in just one
char
, it is already numeric (essentially an unsigned 16-bit integer). You may want to cast it to some of the integer types, but this won't change the actual bits that were originally present in thechar
. -
If your Unicode character is 2
char
s, you'll need to multiply one by 2^16 and add it to the other, resulting in auint
numeric value:char c1 = ...;
char c2 = ...;
uint c = ((uint)c1 << 16) | c2;
How do i get the decimal value of a unicode character in C#?
When you say "decimal", this usually means a character string containing only characters that a human being would interpret as decimal digits.
-
If you can represent your Unicode character by only one
char
, you can convert it to decimal string simply by:char c = 'அ';
string s = ((ushort)c).ToString(); If you have 2
chars
for your Unicode character, convert them to auint
as described above, then calluint.ToString
.
--- EDIT ---
AFAIK diacritical marks are considered separate "characters" (and separate code points) despite being visually rendered together with the "base" character. Each of these code points taken alone is still at most 2 UTF-16 code units.
BTW I think the proper name for what you are talking about is not "character" but "combining character". So yes, a single combining character can have more than 1 code point and therefore more than 2 code units. If you want a decimal representation of such as combining character, you can probably do it most easily through BigInteger
:
string c = "\x0072\x0338\x0327\x0316\x0317\x0300\x0301\x0302\x0308\x0360";
string s = (new BigInteger(Encoding.Unicode.GetBytes(c))).ToString();
Depending on what order of significance of the code unit "digits" you wish, you may want reverse the c
.
Solution 4
char c = 'அ';
short code = (short)c;
ushort code2 = (ushort)c;
Solution 5
This is an example of using Plane 1, the Supplementary Multilingual Plane (SMP):
string single_character = "\U00013000"; //first Egyptian ancient hieroglyph in hex
//it is encoded as 4 bytes (instead of 2)
//get the Unicode index using UTF32 (4 bytes fixed encoding)
Encoding enc = new UTF32Encoding(false, true, true);
byte[] b = enc.GetBytes(single_character);
Int32 code = BitConverter.ToInt32(b, 0); //in decimal
mistertodd
Any code is public domain. No attribution required. జ్ఞా <sup>🕗</sup>🕗 Yes, i do write i with a lowercase i. The Meta Stackexchange answer that I am most proud of
Updated on July 05, 2022Comments
-
mistertodd almost 2 years
How do i get the numeric value of a unicode character in C#?
For example if tamil character
அ
(U+0B85) given, output should be2949
(i.e.0x0B85
)See also
- C++: How to get decimal value of a unicode character in c++
- Java: How can I get a Unicode character's code?
Multi code-point characters
Some characters require multiple code points. In this example, UTF-16, each code unit is still in the Basic Multilingual Plane:
-
(i.e.
U+0072
U+0327
U+030C
) -
(i.e.
U+0072
U+0338
U+0327
U+0316
U+0317
U+0300
U+0301
U+0302
U+0308
U+0360
)
The larger point being that one "character" can require more than 1 UTF-16 code unit, it can require more than 2 UTF-16 code units, it can require more than 3 UTF-16 code units.
The larger point being that one "character" can require dozens of unicode code points. In UTF-16 in C# that means more than 1
char
. One character can require 17char
.My question was about converting
char
into a UTF-16 encoding value. Even if an entire string of 17char
only represents one "character", i still want to know how to convert each UTF-16 unit into a numeric value.e.g.
String s = "அ"; int i = Unicode(s[0]);
-
mistertodd over 12 yearsEven characters in the BMP, like
Ä
(A
+¨
) are represented as two UTF-16 code units. But the point is taken: hard-cast achar
to numeric -
Jon Skeet over 12 years@IanBoyd: Well various characters can be represented using composition, but can also be represented as single UTF-16 code units. By definition if it's in the BMP it has a code point less than 64K, so can be represented as a UTF-16 code unit :)
-
mistertodd over 12 yearsA character is not limited to 2 code points. For example, the character
A̖͇͉͉͉᷿̿᷾︡︠ͯ҄͟͟
is made up of 13 code points (The latin capital letterA
along with a bunch of diacritic marks). (display support depends on browser). But what i want, and what i can get, still mesh with(int)MyString[i]
, each code point has a decimal value that corresponds to aU+xxxx
. -
Branko Dimitrijevic over 12 years@IanBoyd I think you are confusing code "point" with code "unit". Code point represents a "logical" character (the current Unicode has 1,114,112 of them) and is not specific to any particular encoding. On the other hand, a code unit is specific to encoding. AFAIK, a code unit in UTF-16 can appear either alone or in a surrogate pair, certainly not in an array of 13 code units. Are you talking about some encoding other than UTF-16?
-
Serge Wautier over 12 years@Jon: Do you have any pointer regarding surrogate pairs identification? TIA.
-
Jon Skeet over 12 years@Serge-appTranslator: Have a look at
char.ConvertToUtf32(string, int)
,char.IsLowSurrogate
etc. -
Serge Wautier over 12 yearsOops! Sorry: Google, in addition to being your employer, is my friend: Char.IsHighSurrogate(ch), Char.IsLowSurrogate(ch), Char.IsSurrogatePair()
-
mistertodd over 12 yearsi was talking about a character made up of more than two code points (which in UTF-16 more than two code units). e.g. small latin
r
withcaron
andcedilla
(U+0072 U+0327 U+030C
) is a single character. You can have even more complicated characters, made up of 13 UTF-16 code units. Updated question with picture of such a character. -
Branko Dimitrijevic over 12 years@IanBoyd Please see the --- EDIT --- in my answer.