Getting true UTF-8 characters in Java JNI

13,313

This is clearly explained in the Java documentation:

JNI Functions

GetStringUTFChars

const char * GetStringUTFChars(JNIEnv *env, jstring string, jboolean *isCopy);

Returns a pointer to an array of bytes representing the string in modified UTF-8 encoding. This array is valid until it is released by ReleaseStringUTFChars().

Modified UTF-8

The JNI uses modified UTF-8 strings to represent various string types. Modified UTF-8 strings are the same as those used by the Java VM. Modified UTF-8 strings are encoded so that character sequences that contain only non-null ASCII characters can be represented using only one byte per character, but all Unicode characters can be represented.

All characters in the range \u0001 to \u007F are represented by a single byte, as follows:

table1

The seven bits of data in the byte give the value of the character represented.

The null character ('\u0000') and characters in the range '\u0080' to '\u07FF' are represented by a pair of bytes x and y:

table2

The bytes represent the character with the value ((x & 0x1f) << 6) + (y & 0x3f).

Characters in the range '\u0800' to '\uFFFF' are represented by 3 bytes x, y, and z:

table3

The character with the value ((x & 0xf) << 12) + ((y & 0x3f) << 6) + (z & 0x3f) is represented by the bytes.

Characters with code points above U+FFFF (so-called supplementary characters) are represented by separately encoding the two surrogate code units of their UTF-16 representation. Each of the surrogate code units is represented by three bytes. This means, supplementary characters are represented by six bytes, u, v, w, x, y, and z:

table4

The character with the value 0x10000+((v&0x0f)<<16)+((w&0x3f)<<10)+(y&0x0f)<<6)+(z&0x3f) is represented by the six bytes.

The bytes of multibyte characters are stored in the class file in big-endian (high byte first) order.

There are two differences between this format and the standard UTF-8 format. First, the null character (char)0 is encoded using the two-byte format rather than the one-byte format. This means that modified UTF-8 strings never have embedded nulls. Second, only the one-byte, two-byte, and three-byte formats of standard UTF-8 are used. The Java VM does not recognize the four-byte format of standard UTF-8; it uses its own two-times-three-byte format instead.

For more information regarding the standard UTF-8 format, see section 3.9 Unicode Encoding Forms of The Unicode Standard, Version 4.0.

Since U+1F604 is a supplementary character, and Java does not support UTF-8's 4-byte encoding format, U+1F604 is represented in modified UTF-8 by encoding the UTF-16 surrogate pair U+D83D U+DE04 using 3 bytes per surrogate, thus 6 bytes total.

So, to answer your question...

Is there an easy way to convert a Java string to a true UTF-8 byte array in JNI code?

You can either:

  1. Use GetStringChars() to get the original UTF-16 encoded characters, and then create your own UTF-8 byte array from that. The conversion from UTF-16 to UTF-8 is a very simply algorithm to implement by hand, or you can use any pre-existing implementation provided by your platform or 3rd party libraries.

  2. Have your JNI code call back into Java to invoke the String.getBytes(String charsetName) method to encode the jstring object to a UTF-8 byte array, eg:

    JNIEXPORT void JNICALL Java_EmojiTest_nativeTest(JNIEnv *env, jclass cls, jstring _s)
    {
        const jclass stringClass = env->GetObjectClass(_s);
        const jmethodID getBytes = env->GetMethodID(stringClass, "getBytes", "(Ljava/lang/String;)[B");
    
        const jstring charsetName = env->NewStringUTF("UTF-8");
        const jbyteArray stringJbytes = (jbyteArray) env->CallObjectMethod(_s, getBytes, charsetName);
        env->DeleteLocalRef(charsetName);
    
        const jsize length = env->GetArrayLength(stringJbytes);
        const jbyte* pBytes = env->GetByteArrayElements(stringJbytes, NULL); 
    
        for (int i = 0; i < length; ++i)
            fprintf(stderr, "%d: %02x\n", i, pBytes[i]);
    
        env->ReleaseByteArrayElements(stringJbytes, pBytes, JNI_ABORT); 
        env->DeleteLocalRef(stringJbytes);
    }
    

The Wikipedia UTF-8 article suggests that GetStringUTFChars() actually returns CESU-8 rather than UTF-8

Java's Modified UTF-8 is not exactly the same as CESU-8:

CESU-8 is similar to Java's Modified UTF-8 but does not have the special encoding of the NUL character (U+0000).

Share:
13,313

Related videos on Youtube

Rolf
Author by

Rolf

Updated on July 04, 2022

Comments

  • Rolf
    Rolf over 1 year

    Is there an easy way to convert a Java string to a true UTF-8 byte array in JNI code?

    Unfortunately GetStringUTFChars() almost does what's required but not quite, it returns a "modified" UTF-8 byte sequence. The main difference is that a modified UTF-8 doesn't contain any null characters (so you can treat is an ANSI C null terminated string) but another difference seems to be how Unicode supplementary characters such as emoji are treated.

    A character such as U+1F604 "SMILING FACE WITH OPEN MOUTH AND SMILING EYES" is stored as a surrogate pair (two UTF-16 characters U+D83D U+DE04) and has a 4-byte UTF-8 equivalent of F0 9F 98 84, and that is the byte sequence that I get if I convert the string to UTF-8 in Java:

        char[] c = Character.toChars(0x1F604);
        String s = new String(c);
        System.out.println(s);
        for (int i=0; i<c.length; ++i)
            System.out.println("c["+i+"] = 0x"+Integer.toHexString(c[i]));
        byte[] b = s.getBytes("UTF-8");
        for (int i=0; i<b.length; ++i)
            System.out.println("b["+i+"] = 0x"+Integer.toHexString(b[i] & 0xFF));
    

    The code above prints the following:

    😄 c[0] = 0xd83d c[1] = 0xde04 b[0] = 0xf0 b[1] = 0x9f b[2] = 0x98 b[3] = 0x84

    However, if I pass 's' into a native JNI method and call GetStringUTFChars() I get 6 bytes. Each of the surrogate pair characters is being converted to a 3-byte sequence independently:

    JNIEXPORT void JNICALL Java_EmojiTest_nativeTest(JNIEnv *env, jclass cls, jstring _s)
    {
        const char* sBytes = env->GetStringUTFChars(_s, NULL);
        for (int i=0; sBytes[i]!=0; ++i)
            fprintf(stderr, "%d: %02x\n", i, sBytes[i]);
        env->ReleaseStringUTFChars(_s, sBytes);
        return result;
    }
    

    0: ed 1: a0 2: bd 3: ed 4: b8 5: 84

    The Wikipedia UTF-8 article suggests that GetStringUTFChars() actually returns CESU-8 rather than UTF-8. That in turn causes my native Mac code to crash because it's not a valid UTF-8 sequence:

    CFStringRef str = CFStringCreateWithCString(NULL, path, kCFStringEncodingUTF8);
    CFURLRef url = CFURLCreateWithFileSystemPath(NULL, str, kCFURLPOSIXPathStyle, false);
    

    I suppose I could change all my JNI methods to take a byte[] rather than a String and do the UTF-8 conversion in Java but that seems a bit ugly, is there a better solution?

  • skboro
    skboro about 4 years
    Any pointers on how to achieve the reverse? Converting native char* (say "Hello 😄") to Java string?
  • Remy Lebeau
    Remy Lebeau about 4 years
    @skboro assuming the char* points to true UTF-8 data and not "modified" UTF-8 data, either 1) manually decode the UTF-8 to UTF-16 and then pass that to the JNI NewString() function, or 2) use JNI to copy the char data as-is to a Java byte[] array and then pass that to the String constructor that takes a byte[] and charset name as input, specifying "UTF-8" as the charset.
  • Remy Lebeau
    Remy Lebeau about 4 years
    @skboro If the char* points to "modified" UTF-8 data, then you can simply use the JNI NewStringUTF() function by itself.