How to convert a string with Unicode encoding to a string of letters
Solution 1
Technically doing:
String myString = "\u0048\u0065\u006C\u006C\u006F World";
automatically converts it to "Hello World"
, so I assume you are reading in the string from some file. In order to convert it to "Hello" you'll have to parse the text into the separate unicode digits, (take the \uXXXX
and just get XXXX
) then do Integer.ParseInt(XXXX, 16)
to get a hex value and then case that to char
to get the actual character.
Edit: Some code to accomplish this:
String str = myString.split(" ")[0];
str = str.replace("\\","");
String[] arr = str.split("u");
String text = "";
for(int i = 1; i < arr.length; i++){
int hexVal = Integer.parseInt(arr[i], 16);
text += (char)hexVal;
}
// Text will now have Hello
Solution 2
The Apache Commons Lang StringEscapeUtils.unescapeJava() can decode it properly.
import org.apache.commons.lang.StringEscapeUtils;
@Test
public void testUnescapeJava() {
String sJava="\\u0048\\u0065\\u006C\\u006C\\u006F";
System.out.println("StringEscapeUtils.unescapeJava(sJava):\n" + StringEscapeUtils.unescapeJava(sJava));
}
output:
StringEscapeUtils.unescapeJava(sJava):
Hello
Solution 3
You can use StringEscapeUtils
from Apache Commons Lang, i.e.:
String Title = StringEscapeUtils.unescapeJava("\\u0048\\u0065\\u006C\\u006C\\u006F");
Solution 4
This simple method will work for most cases, but would trip up over something like "u005Cu005C" which should decode to the string "\u0048" but would actually decode "H" as the first pass produces "\u0048" as the working string which then gets processed again by the while loop.
static final String decode(final String in)
{
String working = in;
int index;
index = working.indexOf("\\u");
while(index > -1)
{
int length = working.length();
if(index > (length-6))break;
int numStart = index + 2;
int numFinish = numStart + 4;
String substring = working.substring(numStart, numFinish);
int number = Integer.parseInt(substring,16);
String stringStart = working.substring(0, index);
String stringEnd = working.substring(numFinish);
working = stringStart + ((char)number) + stringEnd;
index = working.indexOf("\\u");
}
return working;
}
Solution 5
Shorter version:
public static String unescapeJava(String escaped) {
if(escaped.indexOf("\\u")==-1)
return escaped;
String processed="";
int position=escaped.indexOf("\\u");
while(position!=-1) {
if(position!=0)
processed+=escaped.substring(0,position);
String token=escaped.substring(position+2,position+6);
escaped=escaped.substring(position+6);
processed+=(char)Integer.parseInt(token,16);
position=escaped.indexOf("\\u");
}
processed+=escaped;
return processed;
}
SharonBL
Updated on September 17, 2021Comments
-
SharonBL almost 3 years
I have a string with escaped Unicode characters,
\uXXXX
, and I want to convert it to regular Unicode letters. For example:"\u0048\u0065\u006C\u006C\u006F World"
should become
"Hello World"
I know that when I print the first string it already shows
Hello world
. My problem is I read file names from a file, and then I search for them. The files names in the file are escaped with Unicode encoding, and when I search for the files, I can't find them, since it searches for a file with\uXXXX
in its name. -
SharonBL about 12 yearsSeems that might be the solution. Do you have an idea how can i do it in java - can i do it with String.replaceAll or something like that?
-
NominSim about 12 years@SharonBL I updated with some code, should at least give you an idea of where to start.
-
SharonBL about 12 yearsThank you very much for you help! I also found another solution for that: String s = StringEscapeUtils.unescapeJava("\\u20ac\\n"); it does the work!
-
Shreyansh Shah about 9 yearsString sJava="\u0048\\u0065\u006C\u006C\u006F"; -----> Please do simple change.
-
Joseph Mekwan over 8 yearsafter adding dependacy in build.gradle : compile 'commons-lang:commons-lang:2.6' above working fine.
-
Scott Carey over 7 yearsThis does not work for surrogate pairs at all but is ok for ASCII or 'low' code points. Edit: now that I think about it a bit more, it will work OK with surrogate pairs too.
-
Eugene Lebedev over 6 yearsattempt to reinvent standard methods provided by Standard Java Library. just check pure implementation stackoverflow.com/a/39265921/1511077
-
Eugene Lebedev over 6 yearsattempt to reinvent methods provided by Standard Java Library. just check pure implementation stackoverflow.com/a/39265921/1511077
-
Eugene Lebedev over 6 yearsattempt to reinvent methods provided by Standard Java Library. just check pure implementation stackoverflow.com/a/39265921/1511077
-
Eugene Lebedev over 6 yearsattempt to reinvent methods provided by Standard Java Library. just check pure implementation stackoverflow.com/a/39265921/1511077
-
andrew pate over 6 yearsThanks @EvgenyLebedev ... the standard library way looks good and presumably has been thoroughly tested, much appreciated.
-
Pedro Lobito about 6 yearsI'm always amazed when a "reinvent the wheel" answer gets so many votes.
-
Mohsen Abasi almost 5 yearsNot works when there is non unicode characters inside string, such as: href=\u0022\/en\/blog\/d-day-protecting-europe-its-demons\u0022\u003E\n
-
Mohsen Abasi almost 5 yearsNot works when there is non unicode characters inside string, such as: href=\u0022\/en\/blog\/d-day-protecting-europe-its-demons\u0022\u003E\n
-
rustyx over 4 years@PedroLobito that's because the linked post does absolutely nothing.
myString.getBytes("UTF8")
and then back toString
does nothing. -
rustyx over 4 years
String(string.toByteArray())
achieves literally nothing. -
Eugene Lebedev over 4 years@rustyx Method
toByteArray()
has default argument withCharsets.UTF_8
. Then you create a string from bytearray with required encoding. I did test today withwindows-1251
to utf-8, it works. Also i did comparison at byte level :) -
Eugene Lebedev over 4 years@rustyx here is a gist for you - gist.github.com/lebe-dev/31e31a3399c7885e298ed86810504676
-
Eugene Lebedev over 4 yearsI agree with rustyx because
getBytes
should contain source encoding as argument, not utf-8. After that string should be created with required encoding from ByteArray. -
Евгений Шевченко over 2 yearsYou've just saved my day!
-
Austin Haws about 2 yearsIt appears StringEscapeUtils is now located in org.apache.commons.text.StringEscapeUtils