Checking UTF-8 data type 3-byte, or 4-byte Unicode
Solution 1
UTF-8 encodes everything in the basic multilingual plane (i.e. U+0000 to U+FFFF inclusive) in 1-3 bytes. Therefore, you just need to check whether everything in your string is in the BMP.
In Java, that means checking whether any char
(which is a UTF-16 code unit) is a high or low surrogate character, as Java will use surrogate pairs to encode non-BMP characters:
public static boolean isEntirelyInBasicMultilingualPlane(String text) {
for (int i = 0; i < text.length(); i++) {
if (Character.isSurrogate(text.charAt(i))) {
return false;
}
}
return true;
}
Solution 2
If you do not want to support beyond BMP, you can just strip those characters before handing it to MySQL:
public static String withNonBmpStripped( String input ) {
if( input == null ) throw new IllegalArgumentException("input");
return input.replaceAll("[^\\u0000-\\uFFFF]", "");
}
If you want to support beyond BMP, you need MySQL 5.5+ and you need to change everything that's utf8
to utf8mb4
(collations, charsets ...). But you also need the support for this in the driver that I am
not familiar with. Handling these characters in Java is also a pain because they are spread over 2 chars
and thus need special handling in many operations.
Solution 3
Best approach to strip non-BMP charactres in java that I found is the following:
inputString.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD");
akuzma
Updated on July 27, 2022Comments
-
akuzma over 1 year
In my database I get the error
com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column
I use Java and MySQL 5. As I know 4-byte Unicode is legal i Java, but illegal in MySQL 5, I think that it can cause my problem and I want to check type of my data, so here's my question: How can i check that my UTF-8 data is 3-byte or 4-byte Unicode?
-
verglor over 10 yearsThis actually doesn't work well because regexps are evaluated at the level of codepoints, not codeunits. You need to match chars outside range \u0000-\uFFFF (see my answer).
-
Esailija over 10 years@jako512 That is surprising since everything else deals with code units :I I have edited it to work with full nonBMP characters but the intent behind original version was to to remove unpaired surrogates as well
-
DOOManiac over 10 yearsNote that the REGEX may be slightly tweaked for your language. For PHP, use
preg_replace('/[^\x{0000}-\x{FFFF}]/u', '\x{FFFD}', $input);
-
nrc about 3 years\uF000 - \uFFFF utf8 sequences are still accepted by your regex, but are only used to compose 4 byte chars. So I use the smaller range \u0000 - \uEFFF to remove all 4 byte chars.