How to convert a string with Unicode encoding to a string of letters

java unicode encoding

277,609

Solution 1

Technically doing:

String myString = "\u0048\u0065\u006C\u006C\u006F World";

automatically converts it to "Hello World", so I assume you are reading in the string from some file. In order to convert it to "Hello" you'll have to parse the text into the separate unicode digits, (take the \uXXXX and just get XXXX) then do Integer.ParseInt(XXXX, 16) to get a hex value and then case that to char to get the actual character.

Edit: Some code to accomplish this:

String str = myString.split(" ")[0];
str = str.replace("\\","");
String[] arr = str.split("u");
String text = "";
for(int i = 1; i < arr.length; i++){
    int hexVal = Integer.parseInt(arr[i], 16);
    text += (char)hexVal;
}
// Text will now have Hello

Solution 2

The Apache Commons Lang StringEscapeUtils.unescapeJava() can decode it properly.

import org.apache.commons.lang.StringEscapeUtils;

@Test
public void testUnescapeJava() {
    String sJava="\\u0048\\u0065\\u006C\\u006C\\u006F";
    System.out.println("StringEscapeUtils.unescapeJava(sJava):\n" + StringEscapeUtils.unescapeJava(sJava));
}


 output:
 StringEscapeUtils.unescapeJava(sJava):
 Hello

Solution 3

You can use StringEscapeUtils from Apache Commons Lang, i.e.:

String Title = StringEscapeUtils.unescapeJava("\\u0048\\u0065\\u006C\\u006C\\u006F");

Solution 4

This simple method will work for most cases, but would trip up over something like "u005Cu005C" which should decode to the string "\u0048" but would actually decode "H" as the first pass produces "\u0048" as the working string which then gets processed again by the while loop.

static final String decode(final String in)
{
    String working = in;
    int index;
    index = working.indexOf("\\u");
    while(index > -1)
    {
        int length = working.length();
        if(index > (length-6))break;
        int numStart = index + 2;
        int numFinish = numStart + 4;
        String substring = working.substring(numStart, numFinish);
        int number = Integer.parseInt(substring,16);
        String stringStart = working.substring(0, index);
        String stringEnd   = working.substring(numFinish);
        working = stringStart + ((char)number) + stringEnd;
        index = working.indexOf("\\u");
    }
    return working;
}

Solution 5

Shorter version:

public static String unescapeJava(String escaped) {
    if(escaped.indexOf("\\u")==-1)
        return escaped;

    String processed="";

    int position=escaped.indexOf("\\u");
    while(position!=-1) {
        if(position!=0)
            processed+=escaped.substring(0,position);
        String token=escaped.substring(position+2,position+6);
        escaped=escaped.substring(position+6);
        processed+=(char)Integer.parseInt(token,16);
        position=escaped.indexOf("\\u");
    }
    processed+=escaped;

    return processed;
}

View more solutions

277,609

Author by

SharonBL

Updated on September 17, 2021

Comments

SharonBL almost 3 years
I have a string with escaped Unicode characters, \uXXXX, and I want to convert it to regular Unicode letters. For example:
```
"\u0048\u0065\u006C\u006C\u006F World"
```
should become
```
"Hello World"
```
I know that when I print the first string it already shows Hello world. My problem is I read file names from a file, and then I search for them. The files names in the file are escaped with Unicode encoding, and when I search for the files, I can't find them, since it searches for a file with \uXXXX in its name.
SharonBL about 12 years

Seems that might be the solution. Do you have an idea how can i do it in java - can i do it with String.replaceAll or something like that?
NominSim about 12 years

@SharonBL I updated with some code, should at least give you an idea of where to start.
SharonBL about 12 years

Thank you very much for you help! I also found another solution for that: String s = StringEscapeUtils.unescapeJava("\\u20ac\\n"); it does the work!
Shreyansh Shah about 9 years

String sJava="\u0048\\u0065\u006C\u006C\u006F"; -----> Please do simple change.
Joseph Mekwan over 8 years

after adding dependacy in build.gradle : compile 'commons-lang:commons-lang:2.6' above working fine.
Scott Carey over 7 years

This does not work for surrogate pairs at all but is ok for ASCII or 'low' code points. Edit: now that I think about it a bit more, it will work OK with surrogate pairs too.
Eugene Lebedev over 6 years

attempt to reinvent standard methods provided by Standard Java Library. just check pure implementation stackoverflow.com/a/39265921/1511077
Eugene Lebedev over 6 years

attempt to reinvent methods provided by Standard Java Library. just check pure implementation stackoverflow.com/a/39265921/1511077
Eugene Lebedev over 6 years

attempt to reinvent methods provided by Standard Java Library. just check pure implementation stackoverflow.com/a/39265921/1511077
Eugene Lebedev over 6 years

attempt to reinvent methods provided by Standard Java Library. just check pure implementation stackoverflow.com/a/39265921/1511077
andrew pate over 6 years

Thanks @EvgenyLebedev ... the standard library way looks good and presumably has been thoroughly tested, much appreciated.
Pedro Lobito about 6 years

I'm always amazed when a "reinvent the wheel" answer gets so many votes.
Mohsen Abasi almost 5 years

Not works when there is non unicode characters inside string, such as: href=\u0022\/en\/blog\/d-day-protecting-europe-its-demons\u0‌022\u003E\n
Mohsen Abasi almost 5 years

Not works when there is non unicode characters inside string, such as: href=\u0022\/en\/blog\/d-day-protecting-europe-its-demons\u0‌022\u003E\n
rustyx over 4 years

@PedroLobito that's because the linked post does absolutely nothing. myString.getBytes("UTF8") and then back to String does nothing.
rustyx over 4 years

String(string.toByteArray()) achieves literally nothing.
Eugene Lebedev over 4 years

@rustyx Method toByteArray() has default argument with Charsets.UTF_8. Then you create a string from bytearray with required encoding. I did test today with windows-1251 to utf-8, it works. Also i did comparison at byte level :)
Eugene Lebedev over 4 years

@rustyx here is a gist for you - gist.github.com/lebe-dev/31e31a3399c7885e298ed86810504676
Eugene Lebedev over 4 years

I agree with rustyx because getBytes should contain source encoding as argument, not utf-8. After that string should be created with required encoding from ByteArray.
Евгений Шевченко over 2 years

You've just saved my day!
Austin Haws about 2 years

It appears StringEscapeUtils is now located in org.apache.commons.text.StringEscapeUtils