UTF-8 and UTF-16 in Java

18,711

Solution 1

Although Java holds characters internally as UTF-16, when you convert to bytes using String.getBytes(), each character is converted using the default platform encoding which will likely be something like windows-1252. The results I'm getting are:

utf16 = -30
utf16 = -126
utf16 = -84
utf8 = -30
utf8 = -126
utf8 = -84

This indicates that the default encoding is "UTF-8" on my system.

Also note that the documentation for String.getBytes() has this comment: The behavior of this method when this string cannot be encoded in the default charset is unspecified.

Generally, though, you'll avoid confusion if you always specify an encoding like you do with a.getBytes("UTF-8")

Also, another thing that can cause confusion is including Unicode characters directly in your source file: String a = "€";. That euro symbol has to be encoded to be stored as one or more bytes in a file. When Java compiles your program, it sees those bytes and decodes them back into the euro symbol. You hope. You have to be sure that the software that save the euro symbol into the file (Notepad, eclipse, etc) encodes it the same way as Java expects when it reads it back in. UTF-8 is becoming more popular but it is not universal and many editors will not write files in UTF-8.

Solution 2

One curiosity, I wonder how JVM know the original default charset ...

The mechanism that the JVM uses to determine the initial default charset is platform specific. On UNIX / UNIX-like systems, it is determined by the LANG and LC_* environment variables; see man locale.


Ermmm.. This command is used to check what is the default charset in specific OS?

That is correct. But I told you about it because the manual entry describes how the default encoding is determined by the environment variables.

In retrospect, this may not been what you meant by your original comment, but this IS how the platform default encoding is specified. (And the concept of a "default character set" for an individual file is meaningless; see below.)

What if let say I have 10 Java source file, half of them save as UTF-8 and the rest save as UTF-16, after compile, I move them (class file) into another OS platform, now how JVM know their default encoding? Will the default charset information be included in the Java class file?

That is a rather confused set of questions:

  1. A text file doesn't have a default character set. It has a character set / encoding.

  2. A non-text file doesn't have an character encoding at all. The concept is meaningless.

  3. There's no 100% reliable way to determine what a text file's character encoding is.

  4. If you don't tell the java compiler what the file's encoding is, it will assume that it is the platform's default encoding. The compiler doesn't try to second guess you. If you get the encoding incorrect, the compiler may or may not even notice your mistake.

  5. Bytecode (".class") files are binary files (see 2).

  6. When Character and String literals are compiled into a ".class" file, they are NOW represented in a way that is not affected by the platform default encoding, or anything else that you can influence.

  7. If you made a mistake with the source file encoding when compiling, you can't fix it at the ".class" file level. Your only option is to go back and recompile the classes, telling the Java compiler the correct source file encoding.

  8. "What if let say I have 10 Java source file, half of them save as UTF-8 and the rest save as UTF-16".
    Just don't do it!

    • Don't save your source files in a mixture of encodings. You will drive yourself nuts.
    • I can't thing of a good reason to store files in UTF-16 at all ...

So, I am confused that while people say "platform dependent", is it related to the source file?

Platform dependent means that it potentially depends on the operating system, the JVM vendor and version, the hardware, and so on.

It is not necessarily related to the source file. (The encoding of any given source file could be different to the default character encoding.)

If it is not, how do I explain the phenomena above? Anyway, the confusion above extend my question into "so, what happen after I compile the source file into class file, because class file might not contain the encoding information, so now the result is really dependent on 'platform' but not source file anymore?"

The platform specific mechanism (e.g. the environment variables) determine what the java compiler sees as the default character set. Unless you override this (e.g. by providing options to the java compiler on the command line), that is what the Java compiler will use as the source file character set. However, this may not be the correct character encoding for the source files; e.g. if you created them on a different machine with a different default character set. And if the java compiler uses the wrong character set to decode your source files, it is liable to put incorrect character codes into the ".class" files.

The ".class" files are no platform dependent. But if they are created incorrectly because you didn't tell the Java compiler the correct encoding for the source files, the ".class" files will contain the wrong characters.


Why do you mean :" the concept of a "default character set" for an individual file is meaningless"?

I say it because it is true!

The default character set MEANS the character set that is used when you don't specify one.

But we can control how we want text file to be stored right? Even using notepad, there is an option to choose between the encoding.

That is correct. And that is you TELLING Notepad what character set to use for the file. If you don't TELL it, Notepad will use the default character set to write the file.

There is a little bit of black magic in Notepad to GUESS what the character encoding is when it reads a text file. Basically, it looks at the first few of bytes of the file to see if it starts with a UTF-16 byte-order mark. If it sees one, it can heuristically distinguish between UTF-16, UTF-8 (generated by a Microscoft product), and "other". But it cannot distinguish between the different "other" character encodings, and it doesn't recognize as UTF-8 a file that doesn't start with a BOM marker. (The BOM on a UTF-8 file is a Microsoft-specific convention ... and causes problems if a Java application reads the file and doesn't know to skip the BOM character.)

Anyway, the problems are not in writing the source file. They happen when the Java compiler reads the source file with the incorrect character encoding.

Solution 3

You are working with a bad hypothesis. The getBytes() method does not use the UTF-16 encoding. It uses the platform default encoding.

You can query it with the java.nio.charset.Charset.defaultCharset() method. In my case, it's UTF-8 and should be the same for you too.

Solution 4

Default is either UTF-8 orISO-8859-1 if platform specific encoding is not found. Not UTF-16. So eventually you are doing bytes conversion in UTF-8 only. That is why your byte[] match You can find default encoding using

 System.out.println(Charset.defaultCharset().name());
Share:
18,711

Related videos on Youtube

Sam YC
Author by

Sam YC

Updated on September 16, 2022

Comments

  • Sam YC
    Sam YC over 1 year

    I really expect that the byte data below should show differently, but in fact, they are same, according to wiki http://en.wikipedia.org/wiki/UTF-8#Examples , the encoding in byte look different, but why Java print them out as the same?

        String a = "€";
        byte[] utf16 = a.getBytes(); //Java default UTF-16
        byte[] utf8 = null;
    
        try {
            utf8 = a.getBytes("UTF-8");
        } catch (UnsupportedEncodingException e) {
            throw new RuntimeException(e);
        }
    
        for (int i = 0 ; i < utf16.length ; i ++){
            System.out.println("utf16 = " + utf16[i]);
        }
    
        for (int i = 0 ; i < utf8.length ; i ++){
            System.out.println("utf8 = " + utf8[i]);
        }
    
  • Sam YC
    Sam YC over 11 years
    I am getting clear now, thank. One curiosity, I wonder how JVM know the original default charset (I am using eclipse save as UTF-8, in this case) after it already stored the string as UTF-16 in JVM? Especially we already compile the source file and generate the class file, there is no clue to link it back to source file and know how its original encoding right? Just curious.
  • Sam YC
    Sam YC over 11 years
    Ermmm.. This command is used to check what is the default charset in specific OS? What if let say I have 10 Java source file, half of them save as UTF-8 and the rest save as UTF-16, after compile, I move them (class file) into another OS platform, now how JVM know their default encoding? Will the default charset information be included in the Java class file?
  • Sam YC
    Sam YC over 11 years
    Of course, I assume that I do use some hardcoded string like: String a = "€"; in the mentioned example above.
  • Sam YC
    Sam YC over 11 years
    Pardon, my question is confusing because I am quite confused. Anyway, I follow the asnwer from AmitD, use "System.out.println(Charset.defaultCharset().name());" to print out the default charset, after that, I changed the file properties in eclipse (right click the project and select 'resource'), the default charset is changed.
  • Sam YC
    Sam YC over 11 years
    So, I am confused that while people say "platform dependent", is it related to the source file? If it is not, how do I explain the phenomena above? Anyway, the confusion above extend my question into "so, what happen after I compile the source file into class file, because class file might not contain the encoding information, so now the result is really dependent on 'platform' but not source file anymore?".
  • Sam YC
    Sam YC over 11 years
    Why do you mean :" the concept of a "default character set" for an individual file is meaningless"? But we can control how we want text file to be stored right? Even using notepad, there is an option to choose between the encoding.