How to parse UTF-8 representation to String in Java?

10,750

Solution 1

This works, but only with ASCII. If you use unicode characters outside of the ASCCI range, then you will have problems (as each character is being stuffed into a byte, instead of a full word that is allowed by UTF-8). You can do the typecast below because you know that the UTF-8 will not overflow one byte if you guaranteed that the input is basically ASCII (as you mention in your comments).

package sample;

import java.io.UnsupportedEncodingException;

public class UnicodeSample {
    public static final int HEXADECIMAL = 16;

    public static void main(String[] args) {

        try {
            String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a";

            String arr[] = str.replaceAll("\\\\u"," ").trim().split(" ");
            byte[] utf8 = new byte[arr.length];

            int index=0;
            for (String ch : arr) {
                utf8[index++] = (byte)Integer.parseInt(ch,HEXADECIMAL);
            }

            String newStr = new String(utf8, "UTF-8");
            System.out.println(newStr);

        }
        catch (UnsupportedEncodingException e) {
            // handle the UTF-8 conversion exception
        }
    }
}

Here is another solution that fixes the issue of only working with ASCII characters. This will work with any unicode characters in the UTF-8 range instead of ASCII only in the first 8-bits of the range. Thanks to deceze for the questions. You made me think more about the problem and solution.

package sample;

import java.io.UnsupportedEncodingException;
import java.util.ArrayList;

public class UnicodeSample {
    public static final int HEXADECIMAL = 16;

    public static void main(String[] args) {

        try {
            String str = "\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a\\u3fff\\uf34c";

            ArrayList<Byte> arrList = new ArrayList<Byte>();
            String codes[] = str.replaceAll("\\\\u"," ").trim().split(" ");

            for (String c : codes) {

                int code = Integer.parseInt(c,HEXADECIMAL);
                byte[] bytes = intToByteArray(code);

                for (byte b : bytes) {
                    if (b != 0) arrList.add(b);
                }
            }

            byte[] utf8 = new byte[arrList.size()];
            for (int i=0; i<arrList.size(); i++) utf8[i] = arrList.get(i);

            str = new String(utf8, "UTF-8");
            System.out.println(str);
        }
        catch (UnsupportedEncodingException e) {
            // handle the exception when
        }
    }

    // Takes a 4 byte integer and and extracts each byte
    public static final byte[] intToByteArray(int value) {
        return new byte[] {
                (byte) (value >>> 24),
                (byte) (value >>> 16),
                (byte) (value >>> 8),
                (byte) (value)
        };
    }
}

Solution 2

Firstly, are you just trying to parse a string literal, or is tmp going to be some user-entered data?

If this is going to be a string literal (i.e. hard-coded string), it can be encoded using Unicode escapes. In your case, this just means using single backslashes instead of double backslashes:

String result = "\u0068\u0065\u006c\u006c\u006f\u000a";

If, however, you need to use Java's string parsing rules to parse user input, a good starting point might be Apache Commons Lang's StringEscapeUtils.unescapeJava() method.

Solution 3

I'm sure there must be a better way, but using just the JDK:

public static String handleEscapes(final String s)
{
    final java.util.Properties props = new java.util.Properties();
    props.setProperty("foo", s);
    final java.io.ByteArrayOutputStream baos = new java.io.ByteArrayOutputStream();
    try
    {
        props.store(baos, null);
        final String tmp = baos.toString().replace("\\\\", "\\");
        props.load(new java.io.StringReader(tmp));
    }
    catch(final java.io.IOException ioe) // shouldn't happen
        { throw new RuntimeException(ioe); }
    return props.getProperty("foo");
}

uses java.util.Properties.load(java.io.Reader) to process the backslash-escapes (after first using java.util.Properties.store(java.io.OutputStream, java.lang.String) to backslash-escape anything that would cause problems in a properties-file, and then using replace("\\\\", "\\") to reverse the backslash-escaping of the original backslashes).

(Disclaimer: even though I tested all the cases I could think of, there are still probably some that I didn't think of.)

Share:
10,750
Stephan
Author by

Stephan

Fullstack web developper since 2002 with a preference for the backend part. Client technologies: jQuery 2+ (++), CSS 3 (+), HTML 4+ (++) Server technologies: Java 8 (+++), PHP 5 (+), Classic ASP (++), Spring 4+ (+) Database technologies: Postgresql (++), H2 database (++), Oracle (++), SQLite (+), MySQL (-) Here are the tools I use daily: Maven, Ubuntu 14+, Eclipse, pgAdmin III (yes, I don't like version 4), SQL Developper for Oracle

Updated on June 11, 2022

Comments

  • Stephan
    Stephan 4 months

    Given the following code:

    String tmp = new String("\\u0068\\u0065\\u006c\\u006c\\u006f\\u000a");
    
    String result = convertToEffectiveString(tmp); // result contain now "hello\n"
    

    Does the JDK already provide some classes for doing this ? Is there a libray that does this ? (preferably under maven)

    I have tried with ByteArrayOutputStream with no success.

  • deceze
    deceze over 10 years
    What are "Unicode characters other than UTF-8"? How can a Unicode/UTF-8 character be "stuffed into a byte"? I don't know if you mean the right thing and are not expressing it clearly enough, but that reads mostly wrong.
  • jmq
    jmq over 10 years
    If you use a different unicode character set in the string "str" other than UTF-8, this code may not work. UTF-8 is still using 8 bits, where other unicode character sets may (probably) use more than 8 bits (all 16 bits instead). joelonsoftware.com/articles/Unicode.html
  • Stephan
    Stephan over 10 years
    Obviously, in general case, this code is not enough. But in my case, the input is guaranteed to be fully transalatable into ASCII.
  • deceze
    deceze over 10 years
    @jmq Do you mean if the source code is encoded in a different character set than UTF-8 (which I don't think matters in Java)? Because, while I don't really know Java, those look like Unicode code points, not UTF-8 specific bytes. kunststube.net/encoding
  • deceze
    deceze over 10 years
    @jmq Hmm, your corrected statement makes more sense, but UTF-8 will use more than one byte for non-ASCII characters. This one happens to work because the text is basically just ASCII, but it'll fail for cases that actually contain "Unicode characters" (i.e. non-ASCII characters).