Checking UTF-8 data type 3-byte, or 4-byte Unicode

11,237

Solution 1

UTF-8 encodes everything in the basic multilingual plane (i.e. U+0000 to U+FFFF inclusive) in 1-3 bytes. Therefore, you just need to check whether everything in your string is in the BMP.

In Java, that means checking whether any char (which is a UTF-16 code unit) is a high or low surrogate character, as Java will use surrogate pairs to encode non-BMP characters:

public static boolean isEntirelyInBasicMultilingualPlane(String text) {
    for (int i = 0; i < text.length(); i++) {
        if (Character.isSurrogate(text.charAt(i))) {
            return false;
        }
    }
    return true;
}

Solution 2

If you do not want to support beyond BMP, you can just strip those characters before handing it to MySQL:

public static String withNonBmpStripped( String input ) {
    if( input == null ) throw new IllegalArgumentException("input");
    return input.replaceAll("[^\\u0000-\\uFFFF]", "");
}

If you want to support beyond BMP, you need MySQL 5.5+ and you need to change everything that's utf8 to utf8mb4 (collations, charsets ...). But you also need the support for this in the driver that I am not familiar with. Handling these characters in Java is also a pain because they are spread over 2 chars and thus need special handling in many operations.

Solution 3

Best approach to strip non-BMP charactres in java that I found is the following:

inputString.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD");
Share:
11,237
akuzma
Author by

akuzma

Updated on July 27, 2022

Comments

  • akuzma
    akuzma over 1 year

    In my database I get the error

    com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column
    

    I use Java and MySQL 5. As I know 4-byte Unicode is legal i Java, but illegal in MySQL 5, I think that it can cause my problem and I want to check type of my data, so here's my question: How can i check that my UTF-8 data is 3-byte or 4-byte Unicode?

  • verglor
    verglor over 10 years
    This actually doesn't work well because regexps are evaluated at the level of codepoints, not codeunits. You need to match chars outside range \u0000-\uFFFF (see my answer).
  • Esailija
    Esailija over 10 years
    @jako512 That is surprising since everything else deals with code units :I I have edited it to work with full nonBMP characters but the intent behind original version was to to remove unpaired surrogates as well
  • DOOManiac
    DOOManiac over 10 years
    Note that the REGEX may be slightly tweaked for your language. For PHP, use preg_replace('/[^\x{0000}-\x{FFFF}]/u', '\x{FFFD}', $input);
  • nrc
    nrc about 3 years
    \uF000 - \uFFFF utf8 sequences are still accepted by your regex, but are only used to compose 4 byte chars. So I use the smaller range \u0000 - \uEFFF to remove all 4 byte chars.