why a Chinese character takes one char (2 bytes) but 3 bytes?

13,554

A Java char type stores 16 bits of data in a two-byte object, using every bit to store the data. UTF-8 doesn't do this. For Chinese characters, UTF-8 only uses 6 bits of each byte to store the data. The other two bits contain control information. (It varies depending on the character. For ASCII characters, UTF-8 uses 7 bits.) It's a complicated encoding mechanism, but it allows UTF-8 to store characters up to 32-bits long. This has the advantage of taking only one byte per character for 7-bit (ASCII) characters, making it backward compatible with ASCII. But it needs 3 bytes to store 16-bits of data. You can learn how it works by looking it up on Wikipedia.

Share:
13,554
peterboston
Author by

peterboston

Updated on June 09, 2022

Comments

  • peterboston
    peterboston over 1 year

    I have the following program to test how Java handle Chinese characters:

    String s3 = "世界您好";
    char[] chs = s3.toCharArray();
    byte[] bs = s3.getBytes(StandardCharsets.UTF_8);
    byte[] bs2 = new String(chs).getBytes(StandardCharsets.UTF_8);
    
    System.out.println("encoding=" + Charset.defaultCharset().name() + ", " + s3 + " char[].length=" + chs.length
                    + ", byte[].length=" + bs.length + ", byte[]2.length=" + bs2.length);
    

    The print out is this:

    encoding=UTF-8, 世界您好 char[].length=4, byte[].length=12, byte[]2.length=12

    The result are these:

    1. one Chinese character takes one char, which is 2 bytes in Java, if char[] is used to hold the Chinese characters;

    2. one Chinese character takes 3 bytes if byte[] is used to hold the Chinese characters;

    My questions are if 2 bytes are enough, why we use 3 bytes? if 2 bytes is not enough, why we use 2 bytes?

    EDIT:

    My JVM's default encoding is set to UTF-8.

  • MichaelHuelsen
    MichaelHuelsen almost 3 years
    it is nicely described in the part of the "Encoding" section of the UTF-8 wikipedia article including examples for special characters or languages, en.wikipedia.org/wiki/UTF-8#Encoding
  • MiguelMunoz
    MiguelMunoz almost 3 years
    I should add that UTF-8 was designed to also be compatible with C and C++. In these languages, a byte value of zero meant the end of a String. UTF-8 is designed to never produce a byte of zero for one byte of a multi-byte character.