Checking UTF-8 data type 3-byte, or 4-byte Unicode

java mysql unicode utf-8 character-encoding

11,237

Solution 1

UTF-8 encodes everything in the basic multilingual plane (i.e. U+0000 to U+FFFF inclusive) in 1-3 bytes. Therefore, you just need to check whether everything in your string is in the BMP.

In Java, that means checking whether any char (which is a UTF-16 code unit) is a high or low surrogate character, as Java will use surrogate pairs to encode non-BMP characters:

public static boolean isEntirelyInBasicMultilingualPlane(String text) {
    for (int i = 0; i < text.length(); i++) {
        if (Character.isSurrogate(text.charAt(i))) {
            return false;
        }
    }
    return true;
}

Solution 2

If you do not want to support beyond BMP, you can just strip those characters before handing it to MySQL:

public static String withNonBmpStripped( String input ) {
    if( input == null ) throw new IllegalArgumentException("input");
    return input.replaceAll("[^\\u0000-\\uFFFF]", "");
}

If you want to support beyond BMP, you need MySQL 5.5+ and you need to change everything that's utf8 to utf8mb4 (collations, charsets ...). But you also need the support for this in the driver that I am not familiar with. Handling these characters in Java is also a pain because they are spread over 2 chars and thus need special handling in many operations.

Solution 3

Best approach to strip non-BMP charactres in java that I found is the following:

inputString.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD");

11,237

Author by

akuzma

Updated on July 27, 2022

Comments

akuzma over 1 year
In my database I get the error
```
com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column
```
I use Java and MySQL 5. As I know 4-byte Unicode is legal i Java, but illegal in MySQL 5, I think that it can cause my problem and I want to check type of my data, so here's my question: How can i check that my UTF-8 data is 3-byte or 4-byte Unicode?
verglor over 10 years

This actually doesn't work well because regexps are evaluated at the level of codepoints, not codeunits. You need to match chars outside range \u0000-\uFFFF (see my answer).
Esailija over 10 years

@jako512 That is surprising since everything else deals with code units :I I have edited it to work with full nonBMP characters but the intent behind original version was to to remove unpaired surrogates as well
DOOManiac over 10 years

Note that the REGEX may be slightly tweaked for your language. For PHP, use preg_replace('/[^\x{0000}-\x{FFFF}]/u', '\x{FFFD}', $input);
nrc about 3 years

\uF000 - \uFFFF utf8 sequences are still accepted by your regex, but are only used to compose 4 byte chars. So I use the smaller range \u0000 - \uEFFF to remove all 4 byte chars.