Japanese Character Encoding in Java

25,225

Solution 1

Your changeCharset method seems strange. String objects in Java are best thought of as not have a specific character set. They use Unicode and so can represent all characters, not only one regional subset. Your method says: turn the string into bytes using my system's character set (whatever that may be), and then try and interpret those bytes using some other character set (specified in newCharset), which therefore probably won't work. If you convert to bytes in an encoding, you should read those bytes with the same encoding.

Update:

To convert a String to Shift-JIS (a regional encoding commonly used in Japan) you can say:

byte[] jis = str.getBytes("Shift_JIS");

If you write those bytes into a file, and then open the file in Notepad on a Windows computer where the regional settings are all Japan-centric, Notepad will display it in Japanese (having nothing else to go on, it will assume the text is in the system's local encoding).

However, you could equally well save it as UTF-8 (prefixed with the 3-byte UTF-8 introducer sequence) and Notepad will also display it as Japanese. Shift-JIS is only one way of representing Japanese text as bytes.

Solution 2

I suspect you shouldn't be doing this in the first place. If it really is Apache POI's fault, then you'll need to get the original raw bytes from the data, not just use the system default encdoing.

On the other hand, I think it's entirely likely that Apache POI has managed to do the right thing, and it's just an output problem. I suggest you dump the original string you've got (removing your encoding method entirely) in terms of its Unicode code points, e.g.

 for (int i = 0; i < text.length; i++) {
     System.out.println("U+" + Integer.toHexString(text.charAt(i));
 }

Then check those Unicode values against the ones at the Unicode web site.

Share:
25,225
Allan Jiang
Author by

Allan Jiang

Updated on October 13, 2020

Comments

  • Allan Jiang
    Allan Jiang over 3 years

    Here's my problem. I'm now using using Java Apache POI to read an Excel (.xls or .xlsx) file, and display the contents. There are some Japanese chars in the spreadsheet and all of the Japanese chars I got are "???" in my output. I tried to use Shift-JIS, UTF-8 and many other encoding ways, but it doesn't work... Here's my encoding code below:

    public String encoding(String str) throws UnsupportedEncodingException{
      String Encoding = "Shift_JIS";
      return this.changeCharset(str, Encoding);
    }
    public String changeCharset(String str, String newCharset) throws UnsupportedEncodingException {
      if (str != null) {
        byte[] bs = str.getBytes();
        return new String(bs, newCharset);
      }
      return null;
    }
    

    I am passing in every string I got to encoding(str). But when I print the return value, it's still something like "???" (Like below) but not Japanese characters (Hiragana, Katakana or Kanji).

    title-jp=???
    

    Anyone can help me with this? Thank you so much.

  • Allan Jiang
    Allan Jiang over 12 years
    So can you give me a suggestion of how to convert a given String to Japanese encoding? Many thanks
  • Voo
    Voo over 12 years
    Yep, if he's using the windows cmdline to output the chars, that would explain the problems. If he's using eclipse or another IDE that shouldn't happen though.