How can I check whether a byte array contains a Unicode string in Java?

java regex unicode utf-8

16,386

Solution 1

It's not possible to make that decision with full accuracy in all cases, because an UTF-8 encoded string is one kind of arbitrary binary data, but you can look for byte sequences that are invalid in UTF-8. If you find any, you know that it's not UTF-8.

If you array is large enough, this should work out well since it is very likely for such sequences to appear in "random" binary data such as compressed data or image files.

However, it is possible to get valid UTF-8 data that decodes to a totally nonsensical string of characters (probably from all kinds of diferent scripts). This is more likely with short sequences. If you're worried about that, you might have to do a closer analysis to see whether the characters that are letters all belong to the same code chart. Then again, this may yield false negatives when you have valid text input that mixes scripts.

Solution 2

Here's a way to use the UTF-8 "binary" regex from the W3C site

static boolean looksLikeUTF8(byte[] utf8) throws UnsupportedEncodingException 
{
  Pattern p = Pattern.compile("\\A(\n" +
    "  [\\x09\\x0A\\x0D\\x20-\\x7E]             # ASCII\\n" +
    "| [\\xC2-\\xDF][\\x80-\\xBF]               # non-overlong 2-byte\n" +
    "|  \\xE0[\\xA0-\\xBF][\\x80-\\xBF]         # excluding overlongs\n" +
    "| [\\xE1-\\xEC\\xEE\\xEF][\\x80-\\xBF]{2}  # straight 3-byte\n" +
    "|  \\xED[\\x80-\\x9F][\\x80-\\xBF]         # excluding surrogates\n" +
    "|  \\xF0[\\x90-\\xBF][\\x80-\\xBF]{2}      # planes 1-3\n" +
    "| [\\xF1-\\xF3][\\x80-\\xBF]{3}            # planes 4-15\n" +
    "|  \\xF4[\\x80-\\x8F][\\x80-\\xBF]{2}      # plane 16\n" +
    ")*\\z", Pattern.COMMENTS);

  String phonyString = new String(utf8, "ISO-8859-1");
  return p.matcher(phonyString).matches();
}

As originally written, the regex is meant to be used on a byte array, but you can't do that with Java's regexes; the target has to be something that implements the CharSequence interface (so a char[] is out, too). By decoding the byte[] as ISO-8859-1, you create a String in which each char has the same unsigned numeric value as the corresponding byte in the original array.

As others have pointed out, tests like this can only tell you the byte[] could contain UTF-8 text, not that it does. But the regex is so exhaustive, it seems extremely unlikely that raw binary data could slip past it. Even an array of all zeroes wouldn't match, since the regex never matches NUL. If the only possibilities are UTF-8 and binary, I'd be willing to trust this test.

And while you're at it, you could strip the UTF-8 BOM if there is one; otherwise, the UTF-8 CharsetDecoder will pass it through as if it were text.

UTF-16 would be much more difficult, because there are very few byte sequences that are always invalid. The only ones I can think of offhand are high-surrogate characters that are missing their low-surrogate companions, or vice versa. Beyond that, you would need some context to decide whether a given sequence is valid. You might have a Cyrillic letter followed by a Chinese ideogram followed by a smiley-face dingbat, but it would be perfectly valid UTF-16.

Solution 3

The question assumes that there is a fundamental difference between a string and binary data. While this is intuitively so, it is next to impossible to define precisely what that difference is.

A Java String is a sequence of 16 bit quantities that correspond to one of the (almost) 2**16 Unicode basic codepoints. But if you look at those 16 bit 'characters', each one could equally represent an integer, a pair of bytes, a pixel, and so on. The bit patterns don't have anything intrinsic about that says what they represent.

Now suppose that you rephrased your question as asking for a way to distinguish UTF-8 encoded TEXT from arbitrary binary data. Does this help? In theory no, because the bit patterns that encode any written text can also be a sequence of numbers. (It is hard to say what "arbitrary" really means here. Can you tell me how to test if a number is "arbitrary"?)

The best we can do here is the following:

Test if the bytes are a valid UTF-8 encoding.
Test if the decoded 16-bit quantities are all legal, "assigned" UTF-8 code-points. (Some 16 bit quantities are illegal (e.g. 0xffff) and others are not currently assigned to correspond to any character.) But what if a text document really uses an unassigned codepoint?
Test if the Unicode codepoints belong to the "planes" that you expect based on the assumed language of the document. But what if you don't know what language to expect, or if a document that uses multiple languages?
Test is the sequences of codepoints look like words, sentences, or whatever. But what if we had some "binary data" that happened to include embedded text sequences?

In summary, you can tell that a byte sequence is definitely not UTF-8 if the decode fails. Beyond that, if you make assumptions about language, you can say that a byte sequence is probably or probably not a UTF-8 encoded text document.

IMO, the best thing you can do is to avoid getting into a situation where you program needs to make this decision. And if cannot avoid it, recognize that your program may get it wrong. With thought and hard work, you can make that unlikely, but the probability will never be zero.

Solution 4

In the original question: How can I check whether a byte array contains a Unicode string in Java?; I found that the term Java Unicode is essentially referring to Utf16 Code Units. I went through this problem myself and created some code that could help anyone with this type of question on their mind find some answers.

I have created 2 main methods, one will display Utf-8 Code Units and the other will create Utf-16 Code Units. Utf-16 Code Units is what you will encounter with Java and JavaScript...commonly seen in the form "\ud83d"

For more help with Code Units and conversion try the website;

https://r12a.github.io/apps/conversion/

Here is code...

    byte[] array_bytes = text.toString().getBytes();
    char[] array_chars = text.toString().toCharArray();
    System.out.println();
    byteArrayToUtf8CodeUnits(array_bytes);
    System.out.println();
    charArrayToUtf16CodeUnits(array_chars);


public static void byteArrayToUtf8CodeUnits(byte[] byte_array)
{
    /*for (int k = 0; k < array.length; k++)
    {
        System.out.println(name + "[" + k + "] = " + "0x" + byteToHex(array[k]));
    }*/
    System.out.println("array.length: = " + byte_array.length);
    //------------------------------------------------------------------------------------------
    for (int k = 0; k < byte_array.length; k++)
    {
        System.out.println("array byte: " + "[" + k + "]" + " converted to hex" + " = " + byteToHex(byte_array[k]));
    }
    //------------------------------------------------------------------------------------------
}
public static void charArrayToUtf16CodeUnits(char[] char_array)
{
    /*Utf16 code units are also known as Java Unicode*/
    System.out.println("array.length: = " + char_array.length);
    //------------------------------------------------------------------------------------------
    for (int i = 0; i < char_array.length; i++)
    {
        System.out.println("array char: " + "[" + i + "]" + " converted to hex" + " = " + charToHex(char_array[i]));
    }
    //------------------------------------------------------------------------------------------
}
static public String byteToHex(byte b)
{
    //Returns hex String representation of byte b
    char hexDigit[] =
            {
                    '0', '1', '2', '3', '4', '5', '6', '7',
                    '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'
            };
    char[] array = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
    return new String(array);
}
static public String charToHex(char c)
{
    //Returns hex String representation of char c
    byte hi = (byte) (c >>> 8);
    byte lo = (byte) (c & 0xff);

    return byteToHex(hi) + byteToHex(lo);
}

View more solutions

16,386

Author by

Iain

Updated on August 04, 2022

Comments

Iain almost 2 years

Given a byte array that is either a UTF-8 encoded string or arbitrary binary data, what approaches can be used in Java to determine which it is?

The array may be generated by code similar to:

byte[] utf8 = "Hello World".getBytes("UTF-8");

Alternatively it may have been generated by code similar to:

byte[] messageContent = new byte[256];
for (int i = 0; i < messageContent.length; i++) {
    messageContent[i] = (byte) i;
}

The key point is that we don't know what the array contains but need to find out in order to fill in the following function:

public final String getString(final byte[] dataToProcess) {
    // Determine whether dataToProcess contains arbitrary data or a UTF-8 encoded string
    // If dataToProcess contains arbitrary data then we will BASE64 encode it and return.
    // If dataToProcess contains an encoded string then we will decode it and return.
}

How would this be extended to also cover UTF-16 or other encoding mechanisms?

Daniel Fortunov almost 15 years

-1: Factual error. It is possible for a non-textual binary stream to be decoded as a valid UTF-8 string. If the UTF-8 decoding fails, that implies that your binary data is not UTF-8; but if the UTF-8 decoding doesn't fail, that does not guarantee that the binary data is UTF-8.
McDowell almost 15 years

Java does not insert a BOM automatically and will not remove it on decode.
Michael Borgwardt almost 15 years

+1 Absolutely correct. If it decodes without error, it's valid UTF-8 textual data. It may be textual data that makes absolutely no sense, such as a wild mix of Latin, Chinese, Thai and Greek characters, but that's a semantic distinction, not a technical one.
McDowell almost 15 years

Erk, I should say that Java does not handle BOMs for UTF-8. Whether it does or not for UTF-16/UTF-32 depends on the encoding mechanism chosen: java.sun.com/javase/6/docs/technotes/guides/intl/…
Daniel Fortunov almost 15 years

Fair point Michael. I guess in that case I should have said: -1 Not answering the question. Asserting that it is a valid UTF-8 string is not answering the question, which was trying to find out if it was a string or binary data. Just because it is a valid UTF-8 representation doesn't tell you much about whether the original data is binary (which just happens to be valid UTF-8 by coincidence) or whether the original was genuine textual data.
james.garriss almost 9 years

"what approaches can be used in Java"