What is the maximum number of bytes for a UTF-8 encoded character?

62,693

Solution 1

The maximum number of bytes per character is 4 according to RFC3629 which limited the character table to U+10FFFF:

In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets.

(The original specification allowed for up to six byte character codes for code points past U+10FFFF.)

Characters with a code less than 128 will require 1 byte only, and the next 1920 character codes require 2 bytes only. Unless you are working with an esoteric language, multiplying the character count by 4 will be a significant overestimation.

Solution 2

Without further context, I would say that the maximum number of bytes for a character in UTF-8 is

answer: 6 bytes

The author of the accepted answer correctly pointed this out as the "original specification". That was valid through RFC-2279 1. As J. Cocoe pointed out in the comments below, this changed in 2003 with RFC-3629 2, which limits UTF-8 to encoding for 21 bits, which can be handled with the encoding scheme using four bytes.

answer if covering all unicode: 4 bytes

But, in Java <= v7, they talk about a 3-byte maximum for representing unicode with UTF-8? That's because the original unicode specification only defined the basic multi-lingual plane (BMP), i.e. it is an older version of unicode, or subset of modern unicode. So

answer if representing only original unicode, the BMP: 3 bytes

But, the OP talks about going the other way. Not from characters to UTF-8 bytes, but from UTF-8 bytes to a "String" of bytes representation. Perhaps the author of the accepted answer got that from the context of the question, but this is not necessarily obvious, so may confuse the casual reader of this question.

Going from UTF-8 to native encoding, we have to look at how the "String" is implemented. Some languages, like Python >= 3 will represent each character with integer code points, which allows for 4 bytes per character = 32 bits to cover the 21 we need for unicode, with some waste. Why not exactly 21 bits? Because things are faster when they are byte-aligned. Some languages like Python <= 2 and Java represent characters using a UTF-16 encoding, which means that they have to use surrogate pairs to represent extended unicode (not BMP). Either way that's still 4 bytes maximum.

answer if going UTF-8 -> native encoding: 4 bytes

So, final conclusion, 4 is the most common right answer, so we got it right. But, mileage could vary.

Share:
62,693
Edd
Author by

Edd

I'm a java developer and I ain't afraid to show it

Updated on January 11, 2020

Comments

  • Edd
    Edd over 4 years

    What is the maximum number of bytes for a single UTF-8 encoded character?

    I'll be encrypting the bytes of a String encoded in UTF-8 and therefore need to be able to work out the maximum number of bytes for a UTF-8 encoded String.

    Could someone confirm the maximum number of bytes for a single UTF-8 encoded character please

  • Daniel Marschall
    Daniel Marschall almost 10 years
    What is "esotheric language" for you? Any language which would exist in the real-world, or a text which switches between different languages of the world? Should a developer of an UTF-8-to-String function choose 2, 3 or 4 as multiplicator if he does a over-allocation and the downsizes the result after the actual convertion?
  • matiu
    matiu over 9 years
    @rinntech by 'esoteric language' he means a language that has a lot of high value unicode chars (something from near the bottom of this list: unicode-table.com/en/sections ). If you must over-allocate, choose 4. You could do a double pass, one to see how many bytes you'll need and allocate, then another to do the encoding; that may be better than allocating ~4 times the RAM needed.
  • Evgen Bodunov
    Evgen Bodunov over 8 years
    Always try to handle worst case: hacker9.com/single-message-can-crash-whatsapp.html
  • Tgr
    Tgr about 8 years
    CJKV characters mostly take 3 bytes (with some rare/archaic characters taking 4 bytes) and calling them esoteric is a bit of a stretch (China alone is almost 20% of the world's population...).
  • J. Cocoe
    J. Cocoe over 7 years
    "this is still the current and correct specification, per wikipedia" -- not any more. Shortly after you wrote this (April 2nd edit), Wikipedia's UTF-8 article was changed to clarify that the 6-octet version isn't part of the current (2003) UTF-8 spec.
  • J. Cocoe
    J. Cocoe over 7 years
    "But, in Java <= v7, they talk about a 3-byte maximum for representing unicode with UTF-8? That's because the original unicode specification only defined the basic multi-lingual plane" -- That is probably the original reason, but it's not the whole story. Java uses "modified UTF-8", and one of the modifications is that it "uses its own two-times-three-byte format" instead of "the four-byte format of standard UTF-8" (their words).
  • thomasrutter
    thomasrutter almost 7 years
    There are no codepoints allocated above the 10FFFF (just over a million) limit and many of the UTF8 implementations never implemented sequences longer than 4 bytes (and some only 3, eg MySQL) so I would consider it safe to hard limit to 4 bytes per codepoint even when considering compatibility with older implementations. You would just need to ensure you discard anything invalid on the way in. Note that matiu's recommendation of allocating after calculating exact byte length is a good one where possible.
  • neuralmer
    neuralmer over 6 years
    "... [U]nicode can represent up to x10FFFF code points. So, including 0, that means we can do it with these bytes: F FF FF, i.e. two-and-a-half bytes, or 20 bits." I believe this is a bit incorrect. The number of code points from 0x0 through 0x10FFFF would be 0x110000, which could be represented in 1F FF FF, or 21 bits. The 0x110000 number corresponds to the 17 planes of 0x10000 code points each.
  • ytti
    ytti about 6 years
    i wonder if answer is for question 'how many bytes a utf8 codepoint can be'. I think answer for 'how many bytes a utf8 character can be' the answer is infinite? Because decomposition of multiple codepoints into single character?
  • Nyerguds
    Nyerguds about 5 years
    PSA: Wikipedia is not a real source. Look at the article's actual references.
  • Aaron Franke
    Aaron Franke over 4 years
    Why was it limited to 4 when it was previously 6? What stops us from continuing the standard and having a lead byte of 11111111 and having a 2^(6*7) bit space for characters?
  • David Spector
    David Spector over 4 years
    Note: the paragraph about not using a fixed-width array of characters is my own opinion. I'm willing to edit this answer in response to comments.
  • Dávid Horváth
    Dávid Horváth over 3 years
    Did you mean "exotic"? Or is this some kind of code golf?
  • user904963
    user904963 about 2 years
    Also note that Klingon is in unicode too, so it's not just all human language. As for your recommendation, it will all come down to what you're optimizing for and what benchmarks tell you. Sometimes, it's faster to rip through a known number of bytes without conditional logic or branching. Branching can harm performance severely. If you preprocessed it, you'd have to do the branching still, but at least the heavier computation stuff would be ripping through contiguous memory without zero branches. If you want to optimize for space, it's not a good idea though.
  • user904963
    user904963 about 2 years
    @ytti The question is about what UTF8 is required to use. It's standardized, and the answer is 4 bytes. Of course, you can come up with your own "Unicode" that no one could process and use any number of bytes that you want. It wouldn't be Unicode just like me asserting a byte with a value of 1 is 'z' isn't Unicode either.
  • user904963
    user904963 about 2 years
    @AaronFranke They could expand it if it were needed for some reason, but it's a standardized thing with guarantees that no more than 4 bytes shall be used.
  • David Spector
    David Spector about 2 years
    Klingon is a human language, meaning that it was designed by Marc Okrand and other humans to achieve human purposes. Klingon is not an extraterrestrial language, since the planet Klingon does not exist. As to your apparent defense of the common practice of using six-byte arrays for internal handling of characters, we will have to agree to disagree. Such limits are bugs.
  • user904963
    user904963 about 2 years
    With UTF encoding, the max number of bytes is 4. Depending on the symbols used, you can get away with 1 byte (e.g. English with punctuation) or 2 bytes (If you know there aren't emoji, Chinese, Japanese, etc.). The advantage of preprocessing comes into play more strongly if you run algorithms on the text multiple times. Otherwise, you will have a bunch of branching each time you run an algorithm (although your CPU's branch detector will help a lot if the symbols used result in predictable branching). I didn't say preprocessing is better, only that it can be and testing is needed.
  • David Spector
    David Spector about 2 years
    The minimum number of bytes needed when using a fixed-length array is 6 if you wish to encode emoji, which are quire popular these days. In my own coding, I have found that there is no need to program using fixed-length arrays at all. Whatever you are trying to do can probably be achieved using either byte-oriented programming or by obtaining the actual character length by scanning the UTF-8 bytes.
  • Mikko Rantalainen
    Mikko Rantalainen almost 2 years
    UTF-8 is limited to 4 bytes because the maximum allowed codepoint is U+10FFFF. UTF-8 would need 6 bytes to encode up to U+FFFFFFFF but because that codepoint couldn't be expressed at all using UTF-16 (UTF-16LE or UTF-16BE), UNICODE consortium has decided to stay within limits of UTF-16 encoding for all codepoints.
  • Mikko Rantalainen
    Mikko Rantalainen almost 2 years
    There also exists CESU-8 which is UTF-8 with a twist that it uses surrogate bytes encoded in UTF-8. That results in some characters taking up to 6 bytes. I think some Java or Javascript programs use it as a workaround to support characters outside BMP. Officially it shouldn't be used in any wire format but in practice it leaks to other systems every now and then. Usually this happens when the original author just tried to output UTF-8 and the code was written without understanding what happens with characters outside BMP.