How to parse UTF-8 representation to String in Java?
Solution 1
This works, but only with ASCII. If you use unicode characters outside of the ASCCI range, then you will have problems (as each character is being stuffed into a byte, instead of a full word that is allowed by UTF-8). You can do the typecast below because you know that the UTF-8 will not overflow one byte if you guaranteed that the input is basically ASCII (as you mention in your comments).
package sample;
import java.io.UnsupportedEncodingException;
public class UnicodeSample {
public static final int HEXADECIMAL = 16;
public static void main(String[] args) {
try {
String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a";
String arr[] = str.replaceAll("\\\\u"," ").trim().split(" ");
byte[] utf8 = new byte[arr.length];
int index=0;
for (String ch : arr) {
utf8[index++] = (byte)Integer.parseInt(ch,HEXADECIMAL);
}
String newStr = new String(utf8, "UTF-8");
System.out.println(newStr);
}
catch (UnsupportedEncodingException e) {
// handle the UTF-8 conversion exception
}
}
}
Here is another solution that fixes the issue of only working with ASCII characters. This will work with any unicode characters in the UTF-8 range instead of ASCII only in the first 8-bits of the range. Thanks to deceze for the questions. You made me think more about the problem and solution.
package sample;
import java.io.UnsupportedEncodingException;
import java.util.ArrayList;
public class UnicodeSample {
public static final int HEXADECIMAL = 16;
public static void main(String[] args) {
try {
String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a\\u3fff\\uf34c";
ArrayList<Byte> arrList = new ArrayList<Byte>();
String codes[] = str.replaceAll("\\\\u"," ").trim().split(" ");
for (String c : codes) {
int code = Integer.parseInt(c,HEXADECIMAL);
byte[] bytes = intToByteArray(code);
for (byte b : bytes) {
if (b != 0) arrList.add(b);
}
}
byte[] utf8 = new byte[arrList.size()];
for (int i=0; i<arrList.size(); i++) utf8[i] = arrList.get(i);
str = new String(utf8, "UTF-8");
System.out.println(str);
}
catch (UnsupportedEncodingException e) {
// handle the exception when
}
}
// Takes a 4 byte integer and and extracts each byte
public static final byte[] intToByteArray(int value) {
return new byte[] {
(byte) (value >>> 24),
(byte) (value >>> 16),
(byte) (value >>> 8),
(byte) (value)
};
}
}
Solution 2
Firstly, are you just trying to parse a string literal, or is tmp
going to be some user-entered data?
If this is going to be a string literal (i.e. hard-coded string), it can be encoded using Unicode escapes. In your case, this just means using single backslashes instead of double backslashes:
String result = "\u0068\u0065\u006c\u006c\u006f\u000a";
If, however, you need to use Java's string parsing rules to parse user input, a good starting point might be Apache Commons Lang's StringEscapeUtils.unescapeJava() method.
Solution 3
I'm sure there must be a better way, but using just the JDK:
public static String handleEscapes(final String s)
{
final java.util.Properties props = new java.util.Properties();
props.setProperty("foo", s);
final java.io.ByteArrayOutputStream baos = new java.io.ByteArrayOutputStream();
try
{
props.store(baos, null);
final String tmp = baos.toString().replace("\\\\", "\\");
props.load(new java.io.StringReader(tmp));
}
catch(final java.io.IOException ioe) // shouldn't happen
{ throw new RuntimeException(ioe); }
return props.getProperty("foo");
}
uses java.util.Properties.load(java.io.Reader)
to process the backslash-escapes (after first using java.util.Properties.store(java.io.OutputStream, java.lang.String)
to backslash-escape anything that would cause problems in a properties-file, and then using replace("\\\\", "\\")
to reverse the backslash-escaping of the original backslashes).
(Disclaimer: even though I tested all the cases I could think of, there are still probably some that I didn't think of.)
Stephan
Fullstack web developper since 2002 with a preference for the backend part. Client technologies: jQuery 2+ (++), CSS 3 (+), HTML 4+ (++) Server technologies: Java 8 (+++), PHP 5 (+), Classic ASP (++), Spring 4+ (+) Database technologies: Postgresql (++), H2 database (++), Oracle (++), SQLite (+), MySQL (-) Here are the tools I use daily: Maven, Ubuntu 14+, Eclipse, pgAdmin III (yes, I don't like version 4), SQL Developper for Oracle
Updated on June 11, 2022Comments
-
Stephan almost 2 years
Given the following code:
String tmp = new String("\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a"); String result = convertToEffectiveString(tmp); // result contain now "hello\n"
Does the JDK already provide some classes for doing this ? Is there a libray that does this ? (preferably under maven)
I have tried with ByteArrayOutputStream with no success.
-
Gromski about 12 yearsWhat are "Unicode characters other than UTF-8"? How can a Unicode/UTF-8 character be "stuffed into a byte"? I don't know if you mean the right thing and are not expressing it clearly enough, but that reads mostly wrong.
-
jmq about 12 yearsIf you use a different unicode character set in the string "str" other than UTF-8, this code may not work. UTF-8 is still using 8 bits, where other unicode character sets may (probably) use more than 8 bits (all 16 bits instead). joelonsoftware.com/articles/Unicode.html
-
Stephan about 12 yearsObviously, in general case, this code is not enough. But in my case, the input is guaranteed to be fully transalatable into ASCII.
-
Gromski about 12 years@jmq Do you mean if the source code is encoded in a different character set than UTF-8 (which I don't think matters in Java)? Because, while I don't really know Java, those look like Unicode code points, not UTF-8 specific bytes. kunststube.net/encoding
-
Gromski about 12 years@jmq Hmm, your corrected statement makes more sense, but UTF-8 will use more than one byte for non-ASCII characters. This one happens to work because the text is basically just ASCII, but it'll fail for cases that actually contain "Unicode characters" (i.e. non-ASCII characters).