How to convert a string with Unicode encoding to a string of letters

277,609

Solution 1

Technically doing:

String myString = "\u0048\u0065\u006C\u006C\u006F World";

automatically converts it to "Hello World", so I assume you are reading in the string from some file. In order to convert it to "Hello" you'll have to parse the text into the separate unicode digits, (take the \uXXXX and just get XXXX) then do Integer.ParseInt(XXXX, 16) to get a hex value and then case that to char to get the actual character.

Edit: Some code to accomplish this:

String str = myString.split(" ")[0];
str = str.replace("\\","");
String[] arr = str.split("u");
String text = "";
for(int i = 1; i < arr.length; i++){
    int hexVal = Integer.parseInt(arr[i], 16);
    text += (char)hexVal;
}
// Text will now have Hello

Solution 2

The Apache Commons Lang StringEscapeUtils.unescapeJava() can decode it properly.

import org.apache.commons.lang.StringEscapeUtils;

@Test
public void testUnescapeJava() {
    String sJava="\\u0048\\u0065\\u006C\\u006C\\u006F";
    System.out.println("StringEscapeUtils.unescapeJava(sJava):\n" + StringEscapeUtils.unescapeJava(sJava));
}


 output:
 StringEscapeUtils.unescapeJava(sJava):
 Hello

Solution 3

You can use StringEscapeUtils from Apache Commons Lang, i.e.:

String Title = StringEscapeUtils.unescapeJava("\\u0048\\u0065\\u006C\\u006C\\u006F");

Solution 4

This simple method will work for most cases, but would trip up over something like "u005Cu005C" which should decode to the string "\u0048" but would actually decode "H" as the first pass produces "\u0048" as the working string which then gets processed again by the while loop.

static final String decode(final String in)
{
    String working = in;
    int index;
    index = working.indexOf("\\u");
    while(index > -1)
    {
        int length = working.length();
        if(index > (length-6))break;
        int numStart = index + 2;
        int numFinish = numStart + 4;
        String substring = working.substring(numStart, numFinish);
        int number = Integer.parseInt(substring,16);
        String stringStart = working.substring(0, index);
        String stringEnd   = working.substring(numFinish);
        working = stringStart + ((char)number) + stringEnd;
        index = working.indexOf("\\u");
    }
    return working;
}

Solution 5

Shorter version:

public static String unescapeJava(String escaped) {
    if(escaped.indexOf("\\u")==-1)
        return escaped;

    String processed="";

    int position=escaped.indexOf("\\u");
    while(position!=-1) {
        if(position!=0)
            processed+=escaped.substring(0,position);
        String token=escaped.substring(position+2,position+6);
        escaped=escaped.substring(position+6);
        processed+=(char)Integer.parseInt(token,16);
        position=escaped.indexOf("\\u");
    }
    processed+=escaped;

    return processed;
}
Share:
277,609
SharonBL
Author by

SharonBL

Updated on September 17, 2021

Comments

  • SharonBL
    SharonBL almost 3 years

    I have a string with escaped Unicode characters, \uXXXX, and I want to convert it to regular Unicode letters. For example:

    "\u0048\u0065\u006C\u006C\u006F World"
    

    should become

    "Hello World"
    

    I know that when I print the first string it already shows Hello world. My problem is I read file names from a file, and then I search for them. The files names in the file are escaped with Unicode encoding, and when I search for the files, I can't find them, since it searches for a file with \uXXXX in its name.

  • SharonBL
    SharonBL about 12 years
    Seems that might be the solution. Do you have an idea how can i do it in java - can i do it with String.replaceAll or something like that?
  • NominSim
    NominSim about 12 years
    @SharonBL I updated with some code, should at least give you an idea of where to start.
  • SharonBL
    SharonBL about 12 years
    Thank you very much for you help! I also found another solution for that: String s = StringEscapeUtils.unescapeJava("\\u20ac\\n"); it does the work!
  • Shreyansh Shah
    Shreyansh Shah about 9 years
    String sJava="\u0048\\u0065\u006C\u006C\u006F"; -----> Please do simple change.
  • Joseph Mekwan
    Joseph Mekwan over 8 years
    after adding dependacy in build.gradle : compile 'commons-lang:commons-lang:2.6' above working fine.
  • Scott Carey
    Scott Carey over 7 years
    This does not work for surrogate pairs at all but is ok for ASCII or 'low' code points. Edit: now that I think about it a bit more, it will work OK with surrogate pairs too.
  • Eugene Lebedev
    Eugene Lebedev over 6 years
    attempt to reinvent standard methods provided by Standard Java Library. just check pure implementation stackoverflow.com/a/39265921/1511077
  • Eugene Lebedev
    Eugene Lebedev over 6 years
    attempt to reinvent methods provided by Standard Java Library. just check pure implementation stackoverflow.com/a/39265921/1511077
  • Eugene Lebedev
    Eugene Lebedev over 6 years
    attempt to reinvent methods provided by Standard Java Library. just check pure implementation stackoverflow.com/a/39265921/1511077
  • Eugene Lebedev
    Eugene Lebedev over 6 years
    attempt to reinvent methods provided by Standard Java Library. just check pure implementation stackoverflow.com/a/39265921/1511077
  • andrew pate
    andrew pate over 6 years
    Thanks @EvgenyLebedev ... the standard library way looks good and presumably has been thoroughly tested, much appreciated.
  • Pedro Lobito
    Pedro Lobito about 6 years
    I'm always amazed when a "reinvent the wheel" answer gets so many votes.
  • Mohsen Abasi
    Mohsen Abasi almost 5 years
    Not works when there is non unicode characters inside string, such as: href=\u0022\/en\/blog\/d-day-protecting-europe-its-demons\u0‌​022\u003E\n
  • Mohsen Abasi
    Mohsen Abasi almost 5 years
    Not works when there is non unicode characters inside string, such as: href=\u0022\/en\/blog\/d-day-protecting-europe-its-demons\u0‌​022\u003E\n
  • rustyx
    rustyx over 4 years
    @PedroLobito that's because the linked post does absolutely nothing. myString.getBytes("UTF8") and then back to String does nothing.
  • rustyx
    rustyx over 4 years
    String(string.toByteArray()) achieves literally nothing.
  • Eugene Lebedev
    Eugene Lebedev over 4 years
    @rustyx Method toByteArray() has default argument with Charsets.UTF_8. Then you create a string from bytearray with required encoding. I did test today with windows-1251 to utf-8, it works. Also i did comparison at byte level :)
  • Eugene Lebedev
    Eugene Lebedev over 4 years
  • Eugene Lebedev
    Eugene Lebedev over 4 years
    I agree with rustyx because getBytes should contain source encoding as argument, not utf-8. After that string should be created with required encoding from ByteArray.
  • Евгений Шевченко
    Евгений Шевченко over 2 years
    You've just saved my day!
  • Austin Haws
    Austin Haws about 2 years
    It appears StringEscapeUtils is now located in org.apache.commons.text.StringEscapeUtils