UTF-8 Encoding ; Only some Japanese characters are not getting converted

55,683

Solution 1

Try with JVM parameter file.encoding to set with value UTF-8 in startup of Tomcat(JVM). E.x.: -Dfile.encoding=UTF-8

Solution 2

You are mixing concepts here.

A String is just a sequence of characters (chars); a String in itself has no encoding at all. For what it's worth, replace characters in the above with carrier pigeons. Same thing. A carrier pigeon has no encoding. Neither does a char. (1)

What you are doing here:

new String(x.getBytes(), "UTF-8")

is a "poor man's encoding/decoding process". You will probably have noticed that there are two versions of .getBytes(): one where you pass a charset as an argument and the other where you don't.

If you don't, and that is what happens here, it means you will get the result of the encoding process using your default character set; and then you try and re-decode this byte sequence using UTF-8.

Don't do that. Just take in the string as it comes. If, however, you have trouble reading the original byte stream into a string, it means you use a Reader with the wrong charset. Fix that part.

For more information, read this link.

(1) the fact that, in fact, a char is a UTF-16 code unit is irrelevant to this discussion

Solution 3

I concur with @fge.

Clarification

In java String/char/Reader/Writer handle (Unicode) text, and can combine all scripts in the world.

And byte[]/InputStream/OutputStream are binary data, which need an indication of some encoding to be converted to String.

In your case japaneseStingr should already be a correct String, or be substituted by the original byte[].

Traps in Java

Encoding often is an optional parameter, which then defaults to the platform encoding. You fell in that trap too:

String s = "...";
byte[] b = s.getBytes(); // Platform encoding, non-portable.
byte[] b = s.getBytes("UTF-8"); // Explicit
byte[] b = s.getBytes(StandardCharsets.UTF_8); // Explicit,
                         //  better (for UTF-8, ISO-8859-1)

In general avoid the overloaded methods without encoding parameter, as they are for current-computer only data: non-portable. For completeness: classes FileReader/FileWriter should be avoided as they even provide no encoding parameters.

Error

japaneseString is already wrong. So you have to read that right. It could have been read erroneouslyas Windows-1252 (Windows Latin-1) and suffered when recoding to UTF-8. Evidently only some cases get messed up.

Maybe you had:

String japanesString = new String(bytes);

instead of:

String japanesString = new String(bytes, StandardCharsets.UTF_8);

At the end:

String name = japaneseString;

Show the code for reading japaneseString for further help.

Share:
55,683
Janak
Author by

Janak

Sports Enthusiast | AngularJS developer

Updated on August 12, 2022

Comments

  • Janak
    Janak over 1 year

    I am getting the parameter value as parameter from the Jersey Web Service, which is in Japaneses characters.

    Here, 'japaneseString' is the web service parameter containing the characters in japanese language.

       String name = new String(japaneseString.getBytes(), "UTF-8");
    

    However, I am able to convert a few sting literals successfully, while some of them are creating problems.

    The following were successfully converted:

     1) アップル
     2) 赤
     3) 世丕且且世两上与丑万丣丕且丗丕
     4) 世世丗丈
    

    While these din't:

     1) ひほわれよう
     2) 存在する
    

    When I further investigated, i found that these 2 strings are getting converted in to some JUNK characters.

     1) Input: ひほわれよう        Output : �?��?��?れよ�?�
     2) Input: 存在する            Output: 存在�?�る
    

    Any idea why some of the japanese characters are not converted properly?

    Thanks.