How many characters can UTF-8 encode?

107,406

Solution 1

UTF-8 does not use one byte all the time, it's 1 to 4 bytes.

The first 128 characters (US-ASCII) need one byte.

The next 1,920 characters need two bytes to encode. This covers the remainder of almost all Latin alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets, as well as Combining Diacritical Marks.

Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use[12] including most Chinese, Japanese and Korean [CJK] characters.

Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).

source: Wikipedia

Solution 2

UTF-8 uses 1-4 bytes per character: one byte for ascii characters (the first 128 unicode values are the same as ascii). But that only requires 7 bits. If the highest ("sign") bit is set, this indicates the start of a multi-byte sequence; the number of consecutive high bits set indicates the number of bytes, then a 0, and the remaining bits contribute to the value. For the other bytes, the highest two bits will be 1 and 0 and the remaining 6 bits are for the value.

So a four byte sequence would begin with 11110... (and ... = three bits for the value) then three bytes with 6 bits each for the value, yielding a 21 bit value. 2^21 exceeds the number of unicode characters, so all of unicode can be expressed in UTF8.

Solution 3

Unicode vs UTF-8

Unicode resolves code points to characters. UTF-8 is a storage mechanism for Unicode. Unicode has a spec. UTF-8 has a spec. They both have different limits. UTF-8 has a different upwards-bound.

Unicode

Unicode is designated with "planes." Each plane carries 216 code points. There are 17 Planes in Unicode. For a total of 17 * 2^16 code points. The first plane, plane 0 or the BMP, is special in the weight of what it carries.

Rather than explain all the nuances, let me just quote the above article on planes.

The 17 planes can accommodate 1,114,112 code points. Of these, 2,048 are surrogates, 66 are non-characters, and 137,468 are reserved for private use, leaving 974,530 for public assignment.

UTF-8

Now let's go back to the article linked above,

The encoding scheme used by UTF-8 was designed with a much larger limit of 231 code points (32,768 planes), and can encode 221 code points (32 planes) even if limited to 4 bytes.[3] Since Unicode limits the code points to the 17 planes that can be encoded by UTF-16, code points above 0x10FFFF are invalid in UTF-8 and UTF-32.

So you can see that you can put stuff into UTF-8 that isn't valid Unicode. Why? Because UTF-8 accommodates code points that Unicode doesn't even support.

UTF-8, even with a four byte limitation, supports 221 code points, which is far more than 17 * 2^16

Solution 4

According to this table* UTF-8 should support:

231 = 2,147,483,648 characters

However, RFC 3629 restricted the possible values, so now we're capped at 4 bytes, which gives us

221 = 2,097,152 characters

Note that a good chunk of those characters are "reserved" for custom use, which is actually pretty handy for icon-fonts.

* Wikipedia used show a table with 6 bytes -- they've since updated the article.

2017-07-11: Corrected for double-counting the same code point encoded with multiple bytes

Solution 5

2,164,864 “characters” can be potentially coded by UTF-8.

This number is 27 + 211 + 216 + 221, which comes from the way the encoding works:

  • 1-byte chars have 7 bits for encoding 0xxxxxxx (0x00-0x7F)

  • 2-byte chars have 11 bits for encoding 110xxxxx 10xxxxxx (0xC0-0xDF for the first byte; 0x80-0xBF for the second)

  • 3-byte chars have 16 bits for encoding 1110xxxx 10xxxxxx 10xxxxxx (0xE0-0xEF for the first byte; 0x80-0xBF for continuation bytes)

  • 4-byte chars have 21 bits for encoding 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (0xF0-0xF7 for the first byte; 0x80-0xBF for continuation bytes)

As you can see this is significantly larger than current Unicode (1,112,064 characters).

UPDATE

My initial calculation is wrong because it doesn't consider additional rules. See comments to this answer for more details.

Share:
107,406
eMRe
Author by

eMRe

Updated on October 13, 2020

Comments

  • eMRe
    eMRe over 3 years

    If UTF-8 is 8 bits, does it not mean that there can be only maximum of 256 different characters?

    The first 128 code points are the same as in ASCII. But it says UTF-8 can support up to million of characters?

    How does this work?

  • CodeClown42
    CodeClown42 almost 8 years
    @NickL. No, I mean 3 bytes. In that example, if the first byte of a multibyte sequence begins 1111, the first 1 indicates that it is the beginning of a multibyte sequence, then the number of consecutive 1's after that indicates the number of additional bytes in the sequence (so a first byte will begin either 110, 1110, or 11110).
  • Boris Verkhovskiy
    Boris Verkhovskiy almost 7 years
    This is misleading. The longest code point you can have is 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx, so only 21 bits can be used for encoding the actual character.
  • Gromski
    Gromski almost 7 years
    I said code points may take up to 32 bits to be encoded, I never claimed that (by induction) you can encode 2^32 characters in 32 bit UTF-8. But that is rather moot, since you can encode all existing Unicode characters in UTF-8, and you can encode even more if you stretch UTF-8 to 48 bits (which exists but is deprecated), so I'm not sure what the misleading point is.
  • Jimmy
    Jimmy almost 7 years
    This answer is double counting the number of encodings possible. Once you have counted all 2^7, you cannot count them again in 2^11, 2^16, etc. The correct number of encodings possible is 2^21 (though not all are currently being used).
  • mpen
    mpen almost 7 years
    @Jimmy You sure I'm double counting? 0xxxxxxx gives 7 usable bits, 110xxxxx 10xxxxxx gives 11 more -- there's no overlap. The first byte starts with 0 in the first case, and 1 in the second case.
  • Evan Carroll
    Evan Carroll almost 7 years
    @mpen so what code point does 00000001 store and what does 11000000 100000001 store?
  • mpen
    mpen almost 7 years
    @EvanCarroll Uhh....point taken. Didn't realize there were multiple ways to encode the same code point.
  • Tom Blodget
    Tom Blodget over 6 years
    Your math doesn't respect the UTF-8 rule that only the shortest code unit sequence is allowed to encode a codepoint. So, 00000001 is valid for U+0001 but 11110000 10000000 10000000 10000001 is not. Ref: Table 3-7. Well-Formed UTF-8 Byte Sequences. Besides, the question is directly answered by the table: you just add up the ranges. (They are disjoint to exclude surrogates for UTF-16).
  • Ruben Reyes
    Ruben Reyes over 6 years
    Tom - thanks for your comment! I was unaware of those restrictions. I saw table 3-7 and ran the numbers and it looks like there are 1,083,392 possible valid sequences.
  • kolobok
    kolobok over 6 years
    Found proof for your words in RFC 3629. tools.ietf.org/html/rfc3629#section-3 . However, I don't understand why do I need to place "10" in the beginning of the second byte 110xxxxx 10xxxxxx ? Why not just 110xxxxx xxxxxxxx ?
  • kolobok
    kolobok over 6 years
    Found answer in softwareengineering.stackexchange.com/questions/262227/… . Just for safety reasons (in case a single byte in the middle of the stream is corrupted)
  • CodeClown42
    CodeClown42 over 6 years
    @kolobok Ah. Sans safety you could then encode a 21-bit value in 3 bytes (3 bits indicating the length, plus 21-bits). :D Probably that is not so meaningful though, at least WRT Western languages.
  • Tom Blodget
    Tom Blodget over 5 years
    21 bits is rounded up. Unicode supports 1,114,112 codepoints (U+0000 to U+10FFFF) like it says. (Sometimes described as 17 planes of 65536.)
  • Display name
    Display name over 5 years
    @TomBlodget, You are correct. the most relevant takeaway from this discussion is that UTF-8 can encode all the currently defined points in the Unicode standard and will likely be able to so for quite some time to come.
  • chiperortiz
    chiperortiz about 5 years
    hi @zwippie i new to this. There is something i dont get it.! BMP uses 2 bytes you say is 3? am i wrong?
  • sanderd17
    sanderd17 about 5 years
    @chiperortiz, BMP is indeed 16 bits, so it can be encoded as UTF-16 with constant length per character (UTF-16 also supports going beyond 16 bits, but it's a difficult practice, and many implementations don't support it). However, for UTF-8, you also need to encode how long it will be, so you lose some bits. Which is why you need 3 bytes to encode the complete BMP. This may seem as wasteful, but remember that UTF-16 always uses 2 bytes, but UTF-8 uses one byte per character for most latin-based language characters. Making it twice as compact.
  • c6754
    c6754 almost 5 years
    I'm guessing that NickL asked this but what happened to the rest of the bits in that first byte if the ... represents subsequent bytes instead of bits?
  • CodeClown42
    CodeClown42 almost 5 years
    @NickL "So a four byte sequence would begin with 11110... (... = three bytes for the value)" should have read "...= three bits" (thanks). This is why a 4 byte utf8 character has a 21-bit value (3 + 6 + 6 + 6).
  • daka
    daka over 4 years
    Thanks! Future readers use this
  • jbyrd
    jbyrd about 4 years
    The main thrust of the OP's question is related to why it is called UTF-8 -- this doesn't really answer that.
  • Manu Manjunath
    Manu Manjunath over 3 years
    This is an accurate answer. Other answers have just stopped at 2^21 and forgot the rest of the combinations possible.
  • Timo
    Timo over 3 years
    @sanderd17 the diff between utf8 and 16 is that utf-8 can use as minimum 1 byte (sufficient for latin) whereas 16 needs at least 2 bytes?
  • Lev Lukomsky
    Lev Lukomsky over 2 years
    Yes, theoretically there are 2^7 + 2^11 + 2^16 + 2^21 symbols but a lot of them are invalid by UTF8 rules, so in the end < 2^21. UTF8 must encode 1114112 Unicode symbols and 2^21 is enough. Though many forget that there are symbol combinations – diacritic modifiers, skin tone modifiers, zero-width-joiner, flags, etc. UTF8 symbols are 1-4 bytes, but combination of these symbols could be 6/8/20/... bytes, which can express much more "visible" symbols
  • duketwo
    duketwo about 2 years
    Why everyone is writing a long wall of text, when the technical description can fit in a small paragraph like here! Thumbs up!