Which encoding does Java uses UTF-8 or UTF-16?

22,719

Solution 1

Characters are a graphical entity which is part of human culture. When a computer needs to handle text, it uses a representation of those characters in bytes. The exact representation used is called an encoding.

There are many encodings that can represent the same character - either through the Unicode character set, or through other character sets like the various ISO-8859 encodings, or the JIS X 0208.

Internally, Java uses UTF-16. This means that each character can be represented by one or two sequences of two bytes. The character you were using, 最, has the code point U+6700 which is represented in UTF-16 as the byte 0x67 and the byte 0x00.

That's the internal encoding. You can't see it unless you dump your memory and look at the bytes in the dumped image.

But the method getBytes() does not return this internal representation. Its documentation says:

public byte[] getBytes()

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

The "platform's default charset" is what your locale variables say it is. That is, UTF-8. So it takes the UTF-16 internal representation, and converts that into a different representation - UTF-8.

Note that

new String(bytes, StandardCharsets.UTF_16);

does not "convert it to UTF-16 explicitly" as you assumed it does. This string constructor takes a sequence of bytes, which is supposed to be in the encoding that you have given in the second argument, and converts it to the UTF-16 representation of whatever characters those bytes represent in that encoding.

But you have given it a sequence of bytes encoded in UTF-8, and told it to interpret that as UTF-16. This is wrong, and you do not get the character - or the bytes - that you expect.

You can't tell Java how to internally store strings. It always stores them as UTF-16. The constructor String(byte[],Charset) tells Java to create a UTF-16 string from an array of bytes that is supposed to be in the given character set. The method getBytes(Charset) tells Java to give you a sequence of bytes that represent the string in the given encoding (charset). And the method getBytes() without an argument does the same - but uses your platform's default character set for the conversion.

So you misunderstood what getBytes() gives you. It's not the internal representation. You can't get that directly. only getBytes(StandardCharsets.UTF_16) will give you that, and only because you know that UTF-16 is the internal representation in Java. If a future version of Java decided to represent the characters in a different encoding, then getBytes(StandardCharsets.UTF_16) would not show you the internal representation.

Edit: in fact, Java 9 introduced just such a change in internal representation of strings, where, by default, strings whose characters all fall in the ISO-8859-1 range are internally represented in ISO-8859-1, whereas strings with at least one character outside that range are internally represented in UTF-16 as before. So indeed, getBytes(StandardCharsets.UTF_16) no longer returns the internal representation.

Solution 2

As stated above, java uses UTF-16 as the encoding for character data.

To which it may be added that the set of representable characters is limited to a proper subset of the entire Unicode character set. (I believe java restricts its character set to the Unicode BMP, all of which fit in two bytes of UTF-16.)

So the encoding applied is indeed UTF-16, but the character set to which it is applied is a proper subset of the entire Unicode character set, and this guarantees that Java always uses two bytes per token in its internal String encodings.

Share:
22,719
Nitin Bhardwaj
Author by

Nitin Bhardwaj

Updated on November 12, 2020

Comments

  • Nitin Bhardwaj
    Nitin Bhardwaj over 3 years

    I've already read the following posts:

    1. What is the Java's internal represention for String? Modified UTF-8? UTF-16?
    2. https://docs.oracle.com/javase/8/docs/api/java/lang/String.html

    Now consider the code given below:

    public static void main(String[] args) {
        printCharacterDetails("最");
    }
    
    public static void printCharacterDetails(String character){
        System.out.println("Unicode Value for "+character+"="+Integer.toHexString(character.codePointAt(0)));
        byte[] bytes = character.getBytes();
        System.out.println("The UTF-8 Character="+character+"  | Default: Number of Bytes="+bytes.length);
        String stringUTF16 = new String(bytes, StandardCharsets.UTF_16);
        System.out.println("The corresponding UTF-16 Character="+stringUTF16+"  | UTF-16: Number of Bytes="+stringUTF16.getBytes().length);
        System.out.println("----------------------------------------------------------------------------------------");
    }
    

    When I tried to debug the line character.getBytes() in the code above, the debugger took me into the getBytes() method of String class and then subsequently into the static byte[] encode(char[] ca, int off, int len) method of StringCoding class. The first line of the encode method (String csn = Charset.defaultCharset().name();) returned "UTF-8" as the default encoding during the debugging. I expected it to be "UTF-16".

    The output of the program is:

    Unicode Value for 最=6700 The UTF-8 Character=最 | Default: Number of Bytes=3

    The corresponding UTF-16 Character=� | UTF-16: Number of Bytes=6

    When I converted it to UTF-16 explicitly in the program it took 6 bytes to represent the character. Shouldn't it use 2 or 4 bytes for UTF-16? Why 6 bytes were used?

    Where am I going wrong in my understanding? I use Ubuntu 14.04 and the locale command shows the following:

    LANG=en_US.UTF-8
    

    Does this mean that JVM decides which encoding to use on the basis of underlying OS or does it use UTF-16 only? Please help me understand the concept.

    • Alohci
      Alohci over 7 years
      Don't confuse the default encoding of getBytes() with Java's internal encoding.
    • Robert
      Robert over 7 years
      There is no way to access the internal representation of a String in Java. therefore you don't have to care about...
    • Andy Turner
      Andy Turner over 7 years
      character.getBytes(StandardCharsets.UTF_16), if you want the UTF-16 byte representation.
    • Nitin Bhardwaj
      Nitin Bhardwaj over 7 years
      Thanks Andy ! It gives 4 bytes using your line of code.
    • Nitin Bhardwaj
      Nitin Bhardwaj over 7 years
      Hi Alohci, what do we mean by "Java's internal encoding"? Can you please elaborate?
  • RealSkeptic
    RealSkeptic over 7 years
    This is not correct for the current Java versions. In String object, Java represents characters which are outside of the BMP using surrogate pairs (which is part of the definition of UTF-16). So the char type indeed cannot represent a character outside of the BMP, but a Java String most certainly can.
  • Erwin Smout
    Erwin Smout over 7 years
    Interesting. And what happens upon a charAt() or getChars() invocation for such a String ?
  • RealSkeptic
    RealSkeptic over 7 years
    When you write programs that need to be mindful of such characters, you use appropriate methods, such as codePointAt(int) instead of charAt(int), codePointCount(int,int) instead of length() and so on.
  • Mabsten
    Mabsten over 3 years
    However, if I'm not mistaken, the fact that Java uses UTF-16 for Char and String encoding is not an internal implementation detail, which can remain transparent to the developer, since some unicode code-points will occupy two positions in a String (as said below by RealSkeptic), and Java 9 (compact-string) hasn't changed anything from that point of view.
  • RealSkeptic
    RealSkeptic over 3 years
    @Mabsten, the compact 8-bit represenation can't represent those characters anyway. But it's still an internal representation. If you use charAt it returns a 16 bit UTF-16 char, regardless of the internal representation.