How to parse UTF-8 representation to String in Java?

java utf-8 ascii

10,750

Solution 1

This works, but only with ASCII. If you use unicode characters outside of the ASCCI range, then you will have problems (as each character is being stuffed into a byte, instead of a full word that is allowed by UTF-8). You can do the typecast below because you know that the UTF-8 will not overflow one byte if you guaranteed that the input is basically ASCII (as you mention in your comments).

package sample;
import java.io.UnsupportedEncodingException;
public class UnicodeSample {
    public static final int HEXADECIMAL = 16;
    public static void main(String[] args) {
        try {
            String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a";
            String arr[] = str.replaceAll("\\\\u"," ").trim().split(" ");
            byte[] utf8 = new byte[arr.length];
            int index=0;
            for (String ch : arr) {
                utf8[index++] = (byte)Integer.parseInt(ch,HEXADECIMAL);
            }
            String newStr = new String(utf8, "UTF-8");
            System.out.println(newStr);
        }
        catch (UnsupportedEncodingException e) {
            // handle the UTF-8 conversion exception
        }
    }
}

Here is another solution that fixes the issue of only working with ASCII characters. This will work with any unicode characters in the UTF-8 range instead of ASCII only in the first 8-bits of the range. Thanks to deceze for the questions. You made me think more about the problem and solution.

package sample;
import java.io.UnsupportedEncodingException;
import java.util.ArrayList;
public class UnicodeSample {
    public static final int HEXADECIMAL = 16;
    public static void main(String[] args) {
        try {
            String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a\\u3fff\\uf34c";
            ArrayList<Byte> arrList = new ArrayList<Byte>();
            String codes[] = str.replaceAll("\\\\u"," ").trim().split(" ");
            for (String c : codes) {
                int code = Integer.parseInt(c,HEXADECIMAL);
                byte[] bytes = intToByteArray(code);
                for (byte b : bytes) {
                    if (b != 0) arrList.add(b);
                }
            }
            byte[] utf8 = new byte[arrList.size()];
            for (int i=0; i<arrList.size(); i++) utf8[i] = arrList.get(i);
            str = new String(utf8, "UTF-8");
            System.out.println(str);
        }
        catch (UnsupportedEncodingException e) {
            // handle the exception when
        }
    }
    // Takes a 4 byte integer and and extracts each byte
    public static final byte[] intToByteArray(int value) {
        return new byte[] {
                (byte) (value >>> 24),
                (byte) (value >>> 16),
                (byte) (value >>> 8),
                (byte) (value)
        };
    }
}

Solution 2

Firstly, are you just trying to parse a string literal, or is tmp going to be some user-entered data?

If this is going to be a string literal (i.e. hard-coded string), it can be encoded using Unicode escapes. In your case, this just means using single backslashes instead of double backslashes:

String result = "\u0068\u0065\u006c\u006c\u006f\u000a";

If, however, you need to use Java's string parsing rules to parse user input, a good starting point might be Apache Commons Lang's StringEscapeUtils.unescapeJava() method.

Solution 3

I'm sure there must be a better way, but using just the JDK:

public static String handleEscapes(final String s)
{
    final java.util.Properties props = new java.util.Properties();
    props.setProperty("foo", s);
    final java.io.ByteArrayOutputStream baos = new java.io.ByteArrayOutputStream();
    try
    {
        props.store(baos, null);
        final String tmp = baos.toString().replace("\\\\", "\\");
        props.load(new java.io.StringReader(tmp));
    }
    catch(final java.io.IOException ioe) // shouldn't happen
        { throw new RuntimeException(ioe); }
    return props.getProperty("foo");
}

uses java.util.Properties.load(java.io.Reader) to process the backslash-escapes (after first using java.util.Properties.store(java.io.OutputStream, java.lang.String) to backslash-escape anything that would cause problems in a properties-file, and then using replace("\\\\", "\\") to reverse the backslash-escaping of the original backslashes).

(Disclaimer: even though I tested all the cases I could think of, there are still probably some that I didn't think of.)

10,750

Author by

Stephan

Fullstack web developper since 2002 with a preference for the backend part. Client technologies: jQuery 2+ (++), CSS 3 (+), HTML 4+ (++) Server technologies: Java 8 (+++), PHP 5 (+), Classic ASP (++), Spring 4+ (+) Database technologies: Postgresql (++), H2 database (++), Oracle (++), SQLite (+), MySQL (-) Here are the tools I use daily: Maven, Ubuntu 14+, Eclipse, pgAdmin III (yes, I don't like version 4), SQL Developper for Oracle

Updated on June 11, 2022

Comments

Stephan 4 months
Given the following code:
```
String tmp = new String("\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a");
String result = convertToEffectiveString(tmp); // result contain now "hello\n"
```
Does the JDK already provide some classes for doing this ? Is there a libray that does this ? (preferably under maven)

I have tried with ByteArrayOutputStream with no success.
deceze over 10 years

What are "Unicode characters other than UTF-8"? How can a Unicode/UTF-8 character be "stuffed into a byte"? I don't know if you mean the right thing and are not expressing it clearly enough, but that reads mostly wrong.
jmq over 10 years

If you use a different unicode character set in the string "str" other than UTF-8, this code may not work. UTF-8 is still using 8 bits, where other unicode character sets may (probably) use more than 8 bits (all 16 bits instead). joelonsoftware.com/articles/Unicode.html
Stephan over 10 years

Obviously, in general case, this code is not enough. But in my case, the input is guaranteed to be fully transalatable into ASCII.
deceze over 10 years

@jmq Do you mean if the source code is encoded in a different character set than UTF-8 (which I don't think matters in Java)? Because, while I don't really know Java, those look like Unicode code points, not UTF-8 specific bytes. kunststube.net/encoding
deceze over 10 years

@jmq Hmm, your corrected statement makes more sense, but UTF-8 will use more than one byte for non-ASCII characters. This one happens to work because the text is basically just ASCII, but it'll fail for cases that actually contain "Unicode characters" (i.e. non-ASCII characters).