How many characters can be mapped with Unicode?

73,564

Solution 1

I am asking for the count of all the possible valid combinations in Unicode with explanation.

1,111,998: 17 planes × 65,536 characters per plane - 2048 surrogates - 66 noncharacters

Note that UTF-8 and UTF-32 could theoretically encode much more than 17 planes, but the range is restricted based on the limitations of the UTF-16 encoding.

137,929 code points are actually assigned in Unicode 12.1.

I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.

The purpose of this restriction in UTF-8 is to make the encoding self-synchronizing.

For a counterexample, consider the Chinese GB 18030 encoding. There, the letter ß is represented as the byte sequence 81 30 89 38, which contains the encoding of the digits 0 and 8. So if you have a string-searching function not designed for this encoding-specific quirk, then a search for the digit 8 will find a false positive within the letter ß.

In UTF-8, this cannot happen, because the non-overlap between lead bytes and trail bytes guarantees that the encoding of a shorter character can never occur within the encoding of a longer character.

Solution 2

Unicode allows for 17 planes, each of 65,536 possible characters (or 'code points'). This gives a total of 1,114,112 possible characters. At present, only about 10% of this space has been allocated.

The precise details of how these code points are encoded differ with the encoding, but your question makes it sound like you are thinking of UTF-8. The reason for restrictions on the continuation bytes are presumably so it is easy to find the beginning of the next character (as continuation characters are always of the form 10xxxxxx, but the starting byte can never be of this form).

Solution 3

Unicode supports 1,114,112 code points. There are 2048 surrogate code point, giving 1,112,064 scalar values. Of these, there are 66 non-characters, leading to 1,111,998 possible encoded characters (unless I made a calculation error).

Solution 4

Unicode has the hexadecimal amount of 110000, which is 1114112

Solution 5

To give a metaphorically accurate answer, all of them.

Continuation bytes in the UTF-8 encodings allow for resynchronization of the encoded octet stream in the face of "line noise". The encoder, merely need scan forward for a byte that does not have a value between 0x80 and 0xBF to know that the next byte is the start of a new character point.

In theory, the encodings used today allow for expression of characters whose Unicode character number is up to 31 bits in length. In practice, this encoding is actually implemented on services like Twitter, where the maximal length tweet can encode up to 4,340 bits' worth of data. (140 characters [valid and invalid], times 31 bits each.)

Share:
73,564
Ufuk Hacıoğulları
Author by

Ufuk Hacıoğulları

Updated on November 11, 2020

Comments

  • Ufuk Hacıoğulları
    Ufuk Hacıoğulları over 3 years

    I am asking for the count of all the possible valid combinations in Unicode with explanation. I know a char can be encoded as 1,2,3 or 4 bytes. I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.

  • Ufuk Hacıoğulları
    Ufuk Hacıoğulları almost 13 years
    According to these "planes" even the last three byte of a 4 byte char could express 64 of them. Am I wrong?
  • ninjalj
    ninjalj almost 13 years
    Yes, that is for synchronization, see cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
  • Ufuk Hacıoğulları
    Ufuk Hacıoğulları almost 13 years
    That's outdated I think. It doesn't use 6 bytes anymore
  • Andy Finkenstadt
    Andy Finkenstadt almost 13 years
    Outdated or not, Twitter implemented 31-bit internal character points, and correctly expresses them in UTF-8 when transferred across the wire.
  • tchrist
    tchrist almost 13 years
    Actaully, in theory it is not limited to 31 bits, you can go bigger on a 64 bit machine. perl -le 'print ord "\x{1FFF_FFFF_FFFF}"' prints out 35184372088831 on a 64-bit machine, but gives integer overflow on a 32-bit machine. You can use bigger chars like that inside your perl program, but if you try to print them out as utf8, you get a mandatory warning unless you disable such: perl -le 'print "\x{1FFF_FFFF}"' Code point 0x1FFFFFFF is not Unicode, may not be portable at -e line 1. ######. There is a difference between "loose utf8" and "strict UTF-8": the former is not restricted.
  • tchrist
    tchrist almost 13 years
    @Andy: That makes sense: the original spec for UTF-8 worked for bigger numbers. The 21-bit limit was a sop to the folks who had locked themselves into 16-bit characters, and thus did UCS-2 beget the abomination known as UTF-16.
  • tchrist
    tchrist almost 13 years
    @Simon: There are 34 noncharacter code points, anything that when bitwise-addded with 0xFFFE == 0xFFFE, so two such code points per plane. Also, there are 31 noncharacter code points in the range 0x00_FDD0 .. 0x00_FDEF. Plus you should subtract from that the surrogates, which are not legal for open interchange due to the UTF-16 flaw, but must be supported inside your program.
  • tchrist
    tchrist almost 13 years
    @Ufuk: Unicode doesn't have characters. It has code points. Sometimes it requires multiple code points to make up one character. For example, the character "5̃" is two code points, whereas the character "ñ" may be one or two code points (or more!). There are 2²¹ possible code points, but some of those are reserved as non-characters or partial characters.
  • Philipp
    Philipp almost 13 years
    Unicode is a character encoding standard. First answer from unicode.org/faq/basic_q.html: “Unicode is the universal character encoding,” so saying that “Unicode is not an encoding” is wrong. (I once made that mistake myself.)
  • Philipp
    Philipp almost 13 years
    @tchrist: The Unicode standard defines multiple terms, among them “abstract character” and “encoded character.” So saying that Unicode doesn’t have characters is also not true.
  • Philipp
    Philipp almost 13 years
    Unicode doesn't support 1,114,112 characters because it has less than 1,114,112 scalar values. Only scalar values can be mapped to characters, not all code points.
  • Philipp
    Philipp almost 13 years
    @Ufuk: When you talk about Unicode, only the definition in the Unicode standard counts, and they say that each Unicode encoding scheme can encode exactly the Unicode scalar values—not more, not less.
  • Philipp
    Philipp almost 13 years
    The encodings used today don't allow for 31-bit scalar values. UTF-32 would allow for 32-bit values, UTF-8 for even more, but UTF-16 (used internally by Windows, OS X, Java, .NET, Python, and therefore the most popular encoding scheme) allows for just over one million (which should still be enough).
  • tchrist
    tchrist almost 13 years
    @Philip: You’re wrong about some of those. Python uses UCS-2 or with a wide build, UCS-4; it doesn’t use UTF-16. OS X’s BSD core uses the normal Unix API, which allows it therefore to use UTF-8 for HSF+ — not UTF-16. And I just demonstated that Perl allows far more than the bits you said.
  • dan04
    dan04 almost 13 years
    "All of them" isn't quite accurate; there are characters in legacy encodings that aren't in Unicode. For example, the Apple logo in MacRoman, and a couple of the graphics characters in ATASCII. OTOH, there's a Private Use Area, so these characters can be mapped with Unicode; they're just not part of the standard.
  • tchrist
    tchrist almost 13 years
    @Dan04: Yes, and I mentioned them in my comment above.
  • Ufuk Hacıoğulları
    Ufuk Hacıoğulları almost 13 years
    Can you look at my answer? Why is there 1,112,114 code points?
  • Philipp
    Philipp almost 13 years
    This number comes from the number of planes that is addressable using the UTF-16 surrogate system. You have 1024 low surrogates and 1024 high surrogates, giving 1024² non-BMP code points. This plus the 65,536 BMP code points gives exactly 1,114,112.
  • Philipp
    Philipp almost 13 years
    @tchrist: Python 3 does use UTF-16; for example, on my system I can say len(chr(0x10000)), giving 2 (code units). OS X's kernel uses UTF-8, correct—but the high-level APIs (Cocoa etc.) use UTF-16.
  • tchrist
    tchrist almost 13 years
    @Philip: I only use Python 2, whose Unicode support leaves a lot to be desired. I’m a systems guy, so I don’t do end-user chrome-platting: all the syscalls I use on OS X take UTF-8, which the kernel converts into NFC for you. My UTF-16 experiences in Java have been bad: try a regex bracketed charclass match with literal some non-BMP codepoints in their, like [𝒜-𝒵], and you’ll see why I find exposing UTF-16 to be a botch. It’s a mistake to make programmers think in encoding forms instead of in logical characters.
  • tchrist
    tchrist almost 13 years
    @Philip: It is inherently evil and wrong, plus just plain stupid, that the length of any single code point should ever be other than one character long. If Python does that, then it is screwed up. In Perl, there is no legal input to the chr function that length can ever return other than 1 for. perl -le 'printf "codepoint %06X is length %d\n", $_, length chr for 1, 0x100, 0x1000, 0x01_FFFF, 0x10_000_000' shows that those are all of them one character long. This is what’s called abstraction, and it is the normal level that characters should be dealt with. Anything else is broken.
  • dan04
    dan04 almost 13 years
    @tchrist: I agree. It would have been better if Unicode had been designed for 1 million+ characters from the start so we would never have had to deal with the ugly hack that is UTF-16.
  • Pacerier
    Pacerier over 12 years
    The "self-synchronizing" article you linked doesn't explain what's self-synchronizing at all
  • Shawn Kovac
    Shawn Kovac over 7 years
    @Philipp, but you give '1_112_114' in your answer, but you explain '1_114_112' in your comment. Perhaps you mixed up the 2 and 4.
  • Ray Toal
    Ray Toal about 6 years
    This answer has been sitting around with the calculation errors for years now, so I took the liberty to clean it up. Yes, the value 1112114 in the answer was a typo. The correct value is 1114112, which is the decimal value of 0x110000.
  • santiago arizti
    santiago arizti about 6 years
    just as an interesting note, UTF8 only needs 4 bytes to map all Unicode characters, but UTF8 can support up to 68 billion characters if it is ever required, taking up to 7 bytes per character.