Java String Unicode Value

42,563

Solution 1

Some unicode characters span two Java chars. Quote from http://docs.oracle.com/javase/tutorial/i18n/text/unicode.html :

The characters with values that are outside of the 16-bit range, and within the range from 0x10000 to 0x10FFFF, are called supplementary characters and are defined as a pair of char values.

correct way to escape non-ascii:

private static String escapeNonAscii(String str) {

  StringBuilder retStr = new StringBuilder();
  for(int i=0; i<str.length(); i++) {
    int cp = Character.codePointAt(str, i);
    int charCount = Character.charCount(cp);
    if (charCount > 1) {
      i += charCount - 1; // 2.
      if (i >= str.length()) {
        throw new IllegalArgumentException("truncated unexpectedly");
      }
    }

    if (cp < 128) {
      retStr.appendCodePoint(cp);
    } else {
      retStr.append(String.format("\\u%x", cp));
    }
  }
  return retStr.toString();
}

Solution 2

This method converts an arbitrary String to an ASCII-safe representation to be used in Java source code (or properties files, for example):

public String escapeUnicode(String input) {
  StringBuilder b = new StringBuilder(input.length());
  Formatter f = new Formatter(b);
  for (char c : input.toCharArray()) {
    if (c < 128) {
      b.append(c);
    } else {
      f.format("\\u%04x", (int) c);
    }
  }
  return b.toString();
}
Share:
42,563
user489041
Author by

user489041

Updated on January 08, 2020

Comments

  • user489041
    user489041 over 4 years

    How can I get the unicode value of a string in java?

    For example if the string is "Hi" I need something like \uXXXX\uXXXX

  • tchrist
    tchrist about 13 years
    @user489041: I disagree: The right way to do this is to compile with java -encoding UTF-8. No mess, no fuss. This is especially because 20 years on, Java still has no standard way to talk about code points by their official names. That means you are trying to insert evil and mysterious magic numbers in your code. That is not a good thing! Sure, I might rather see "\N{GREEK SMALL LETTER ALPHA}" than "α", but I SURELY do not want to see "\u03B1"! That’s just wicked. How are you going to maintain that kind of crudola?
  • Martin
    Martin over 11 years
    Only 4 Digits? Unicode is a 32bit character set and the OP spoke of Japanese.
  • Joachim Sauer
    Joachim Sauer over 11 years
    @Martin: 1.) strictly speaking "Unicode" is not an n-bit character set for any value of n. 2.) most Japanese characters fall into the basic multilingual pane (the first 64k Unicode codepoints) and can be represented with just 4 hexadecimal digits and 3.) the unicode escapes in Java use UTF-16, so if you have to present anything outside the BMP, you'll have to use two \u escapes (with the correct surrogate values) which is incidentally what my code does because a char is really a UTF-16 codepoint and not a Unicode codepoint (those two are the same thing, iff the character is in the BMP).
  • Josejulio
    Josejulio over 11 years
    I used this to obfuscate some strings (yeah, just making hard to read them)
  • Robin Royal
    Robin Royal over 9 years
    @JoachimSauer Thanks man.Its worked like a charm.Please tell me how to decode this message when I get back from server.this is what it return "\ud83d\ude1c".
  • silentsudo
    silentsudo over 5 years
    This worked perfectly without any change thank you so much.