Java - what are characters, code points and surrogates? What difference is there between them?

23,998

Solution 1

To represent text in computers, you have to solve two things: first, you have to map symbols to numbers, then, you have to represent a sequence of those numbers with bytes.

A Code point is a number that identifies a symbol. Two well-known standards for assigning numbers to symbols are ASCII and Unicode. ASCII defines 128 symbols. Unicode currently defines 109384 symbols, that's way more than 216.

Furthermore, ASCII specifies that number sequences are represented one byte per number, while Unicode specifies several possibilities, such as UTF-8, UTF-16, and UTF-32.

When you try to use an encoding which uses less bits per character than are needed to represent all possible values (such as UTF-16, which uses 16 bits), you need some workaround.

Thus, Surrogates are 16-bit values that indicate symbols that do not fit into a single two-byte value.

Java uses UTF-16 internally to represent text.

In particular, a char (character) is an unsigned two-byte value that contains a UTF-16 value.

If you want to learn more about Java and Unicode, I can recommend this newsletter: Part 1, Part 2

Solution 2

You can find a short explanation in the Javadoc for the class java.lang.Character:

Unicode Character Representations

The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. [..]

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

In other words:

A code point usually represents a single character. Originally, the values of type char matched exactly the Unicode code points. This encoding was also known as UCS-2.

For that reason, char was defined as a 16-Bit type. However, there are currently more than 2^16 characters in Unicode. To support the whole character set, the encoding was changed from the fixed-length encoding UCS-2 to the variable-length encoding UTF-16. Within this encoding, each code point is represented by a single char or by two chars. In the latter case, the two chars are called a surrogate pair.

UTF-16 was defined in such a way, that there is no difference between text encoded with UTF-16 and UCS-2, if all code points are below 2^14. That means, char can be used to represent some but not all characters. If a character can not be represented within a single char, the term char is misleading, because it is just used as as 16-Bit word.

Solution 3

Code points typically refers to Unicode codepoints. The Unicode glossary says this:

Codepoint(1): Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16.

In Java, a character (char) is an unsigned 16 bit value; i.e 0 to FFFF.

As you can see, there are more Unicode codepoints that can be represented as Java characters. And yet Java needs to be able to represent text using all valid Unicode codepoints.

The way that Java deals with this is to represent codepoints that are larger than FFFF as a pair of characters (code units); i.e. a surrogate pair. These encode a Unicode codepoint that is larger than FFFF as a pair of 16 bit values. This uses the fact that a subrange of the Unicode code space (i.e. D800 to U+DFFF) is reserved for representing surrogate pairs. The technical details are here.


The proper term for the encoding that Java is using is the UTF-16 Encoding Form.

Another term that you might see is code unit which is the minimum representational unit used in a particular encoding. In UTF-16 the code unit is 16 bits, which corresponds to a Java char. Other encodings (e.g. UTF-8, ISO 8859-1, etc) have 8 bit code units, and UTF-32 has a 32 bit code unit.


The term character has many meanings. It means all sorts of things in different contexts. The Unicode glossary gives 4 meanings for Character as follows:

Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding.

Character. (2) Synonym for abstract character. (Abstract Character. A unit of information used for the organization, control, or representation of textual data.)

Character. (3) The basic unit of encoding for the Unicode character encoding.

Character. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).]

And then there is the Java specific meaning for character; i.e. a 16 bit signed number (of type char) that may or may not represent a complete or partial Unicode codepoint in UTF-16 encoding.

Solution 4

To begin with, unicode is a standard which tries to define and map all individual characters from all languages, from english letters to chinese, numbers, symbols etc.

Basically unicode has long list of numbered characters where the code point refers to the numbering.

In short

  • Characters are the individual tokens in a text, whether letter, number or symbol.
  • A code point refers to numbering of a token in the unicode standard
  • Characters represented using the UTF-16 encoding scheme houses so many characters that all does not fit in the alotted space of single a java character.
  • Surrogate pairs is the term used to say that one character needs to be represented in the space of a pair of characters. Surrogate pairs is the term used to say that one character is listed so high in the unicode table it needs a pair of character spaces to represent it.

Solution 5

Simply put:

  • Code unit is char that takes 2 byte, encoded as UTF-16, each char not necessarily represent a real world character.
  • Code point is always a real world character, it may contain 1 or 2 Code unit, see it as an int, that may take 4 bytes.

Let the code (test case) tell the truth:
(need Java 9+, due to String's method codePoints() and chars())

@Test
public void test() {
    String s = "Hi, 你好, おはよう, α-Ω\uD834\uDD1E"; // last real character is "𝄞", that takes 2 code unit,
    assertEquals(s.length(), s.toCharArray().length); // length() is based on char (aka code unit), not code point,

    System.out.printf("input string:\t\"%s\"%n%n", s);

    System.out.println("------ as code point (aka. real character) ------");
    // code point,
    s.codePoints().forEach(cp -> System.out.println(Character.toChars(cp)));
    assertEquals(s.codePoints().count(), s.length() - 1); // last read character takes 2 unit code,
    assertEquals(s.codePoints().count(), s.codePointCount(0, s.length())); // there is a method codePointCount() on String to get code point count on given char range,

    System.out.println("\n------ as char (aka. code unit) ------");
    // chars (aka. code unit),
    s.chars().forEach(c -> System.out.println(Character.toChars(c)));
    assertEquals(s.chars().count(), s.length()); // string length is the count of code unit, not code point,
}

Output:

input string:   "Hi, 你好, おはよう, α-Ω𝄞"

------ as code point (aka. real character) ------
H
i
,
 
你
好
,
 
お
は
よ
う
,
 
α
-
Ω
𝄞

------ as char (aka. code unit) ------
H
i
,
 
你
好
,
 
お
は
よ
う
,
 
α
-
Ω
?
?

The last real character is 𝄞, it takes 2 code unit \uD834\uDD1E, and it's a single code point, when try to print the 2 code unit separately, they can't be recognized, and shows ? for each.

Share:
23,998

Related videos on Youtube

Alium Britt
Author by

Alium Britt

Intermediate programmer in PHP, Python, and Javascript, mainly for use in application and web development. Starting to learn more low-level programming languages like Java (mainly for grad school). Worked with both MS SQL Server and MySQL databases. Just got accepted to a grad school program in intelligent systems/AI. Some experience with Byond, Unity3D and GameMaker. Bachelor's degree in Astronomy/Astrophysics. Native English speaker and English teacher with a TEFL certificate. Lived and worked in Japan for two and a half years teaching English and currently living and teaching in Spain.

Updated on July 09, 2022

Comments

  • Alium Britt
    Alium Britt almost 2 years

    I'm trying to find an explanation of the terms "character", "code point" and "surrogate", and while these terms aren't limited to Java, if there are any language-specific differences I'd like the explanation as it relates to Java.

    I've found some information about the differences between characters and code points, characters being what is displayed for human users, and code points being a value encoding that specific character, but I have a no idea about surrogates. What are surrogates, and how are they different from characters and code points? Do I have the right definitions for characters and code points?

    In another thread about stepping through a string as an array of characters, the specific comment that prompted this question was "Note that this technique gives you characters, not code points, meaning you may get surrogates." I didn't really understand, and rather than create a long series of comments on a 5-year-old question I thought it would be best to ask for clarification in a new question.

    • Alium Britt
      Alium Britt almost 10 years
      All of these answers so far have added more to my understanding of the terms in my question, so while I'm picking one "answer", I think they all helped me.
  • Alium Britt
    Alium Britt almost 10 years
    And if I remember correctly, 8 bits = 1 byte, so that would put UTF-8 as 1 byte per character, UTF-16 as 2 bytes, and UTF-32 as 4 bytes correct?
  • Alium Britt
    Alium Britt almost 10 years
    In that case, would "surrogate" as I phrased it be equivalent to "surrogate pair", since there would always be two if I wanted the representation of character?
  • Johan Sjöberg
    Johan Sjöberg almost 10 years
    @AliumBritt Not quite so easy. UTF-8/16 are roughly equivalent, different mechanics. UTF-8 is 1-4 bytes with UTF-16 being 2 bytes.
  • Cephalopod
    Cephalopod almost 10 years
    @AliumBritt UTF-8 and -16 use 1 or 2 bytes when possible, but for the higher code points, using 4 bytes is inevitable.
  • Stephen C
    Stephen C almost 10 years
    @Cephalopod - Nitpick: Strictly speaking a UTF-8 "code point" can be up to 6 bytes ... except that bytes 5 and 6 are only required for "planes" that are outside of the official Unicode codepoint space. (And they've said they will never go there ...)
  • Cephalopod
    Cephalopod almost 10 years
    @StephenC I think they could even go to seven bytes, given that there's still one prefix bit left. To clarify: with 4 bytes, UTF-8 can encode 2097151 code points, 20 times the number of code points that are currently defined. So 4 bytes won't be exceeded anytime soon.
  • Stephen C
    Stephen C almost 10 years
    I'm wrong. The definitive UTF-8 spec is Unicode 6.0.0, and it explicitly defines the encoding for the Unicode codepoint range only. The 5, 6 or even 7 byte forms are non-standard extensions. (And according to the Wikipedia page, extending to 7 bytes requires using one of the bytes of a BOM ... which would be a bad thing.)