Check to see if a string is encoded as UTF-8
Solution 1
I use two ways to check if string is utf-8 (depending on the case):
mb_internal_encoding('UTF-8'); // always needed before mb_ functions, check note below
if (mb_strlen($string) != strlen($string)) {
/// not single byte
}
-- OR --
if (preg_match('!\S!u', $string)) {
// utf8
}
For the mb_internal_encoding - due to some unknown to me bug in php (version 5.3- (haven't tested it on 5.3)) passing the encoding as a parameter to the mb_ function doesn't work and the internal encoding needs to be set before any use of mb_ functions.
Solution 2
That algorithm is basically checking if the byte sequence conforms to the pattern that you can see in the Wikipedia article.
The for
loop is to go through all bytes in $str
. ord
gets the decimal number of the current byte. That number is then tested for some properties.
If the number if less than 128 (0x80), it’s a single byte character. If it’s equal or larger than 128, the length of the multi-byte character is checked. That can by done with the first character of a multi-byte character sequence. If the first byte begins with 110xxxxx
, it’s a two byte character; 1110xxxx
, it’s a three byte character, etc.
I think the most cryptical parts are the expressions like ($c & 0xE0) == 0xC0
. That is to check if the number in binary format has some specific pattern. I’ll try to explain how that works on the same example.
Since all numbers we test for that pattern are equal to or greater than 0x80, the first byte is always 1, so the pattern is restricted to at least 1xxxxxxxx
. If we then do a bit-wise AND comparison with 11100000
(0xE0), we get this this result:
1xxxxxxx
& 11100000
= 1xx00000
So the bits at position 5 and 6 (read from the right, index started at 0) depend on what our current number is. To have that equal to 11000000
, the 5th bit must be 0
and the 6th bit must be 1
:
1xxxxxxx
& 11100000
≟ 11000000
↓↓
→ 110xxxxx
That means the other bits of our number can be arbitrary: 110xxxxx
. And that’s exactly what the pattern in the Wikipedia article predicts for the first byte of a two byte character word.
And last the inner for
loop is to check the sanity of the following bytes of a multi-byte character. Those all must begin with 10xxxxxx
.
Solution 3
If you know a little about UTF-8 it's a pretty simple implementation.
function seems_utf8($str) {
# get length, for utf8 this means bytes and not characters
$length = strlen($str);
# we need to check each byte in the string
for ($i=0; $i < $length; $i++) {
# get the byte code 0-255 of the i-th byte
$c = ord($str[$i]);
# utf8 characters can take 1-6 bytes, how much
# exactly is decoded in the first character if
# it has a character code >= 128 (highest bit set).
# For all <= 127 the ASCII is the same as UTF8.
# The number of bytes per character is stored in
# the highest bits of the first byte of the UTF8
# character. The bit pattern that must be matched
# for the different length are shown as comment.
#
# So $n will hold the number of additonal characters
if ($c < 0x80) $n = 0; # 0bbbbbbb
elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
else return false; # Does not match any model
# the code now checks the following additional bytes
# First if expression checks that the byte is really inside the
# string and not running over the string end.
# The second expression just check that the highest two bits of all
# additonal bytes are always 1 and 0 (hexadecimal 0x80)
# which is a requirement for all additional UTF-8 bytes
for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
return false;
}
}
return true;
}
By the way. On PHP i assume that this is a factor 50-100 slower then a C function so you shouldn't really use it on long strings and production systems.
Comments
-
Bojan Muvrin almost 2 years
function seems_utf8($str) { $length = strlen($str); for ($i=0; $i < $length; $i++) { $c = ord($str[$i]); if ($c < 0x80) $n = 0; # 0bbbbbbb elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b else return false; # Does not match any model for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ? if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80)) return false; } } return true; }
I got this code from Wordpress, I don't know much about this, but I would like to know what exactly happing in that function.
If any one know please help me out?
I need the clear idea about the above code. If line by line explanation will be more helpful.
-
Seiji Manoan almost 9 yearsSo just do
mb_strlen ($string, 'UTF-8')
then.