Check to see if a string is encoded as UTF-8

php string encoding utf-8

13,224

Solution 1

I use two ways to check if string is utf-8 (depending on the case):

mb_internal_encoding('UTF-8'); // always needed before mb_ functions, check note below
if (mb_strlen($string) != strlen($string)) {
 /// not single byte
}

-- OR --

if (preg_match('!\S!u', $string)) {
 // utf8
}

For the mb_internal_encoding - due to some unknown to me bug in php (version 5.3- (haven't tested it on 5.3)) passing the encoding as a parameter to the mb_ function doesn't work and the internal encoding needs to be set before any use of mb_ functions.

Solution 2

That algorithm is basically checking if the byte sequence conforms to the pattern that you can see in the Wikipedia article.

The for loop is to go through all bytes in $str. ord gets the decimal number of the current byte. That number is then tested for some properties.

If the number if less than 128 (0x80), it’s a single byte character. If it’s equal or larger than 128, the length of the multi-byte character is checked. That can by done with the first character of a multi-byte character sequence. If the first byte begins with 110xxxxx, it’s a two byte character; 1110xxxx, it’s a three byte character, etc.

I think the most cryptical parts are the expressions like ($c & 0xE0) == 0xC0. That is to check if the number in binary format has some specific pattern. I’ll try to explain how that works on the same example.

Since all numbers we test for that pattern are equal to or greater than 0x80, the first byte is always 1, so the pattern is restricted to at least 1xxxxxxxx. If we then do a bit-wise AND comparison with 11100000 (0xE0), we get this this result:

  1xxxxxxx
& 11100000
= 1xx00000

So the bits at position 5 and 6 (read from the right, index started at 0) depend on what our current number is. To have that equal to 11000000, the 5th bit must be 0 and the 6th bit must be 1:

  1xxxxxxx
& 11100000
≟ 11000000
   ↓↓
→ 110xxxxx

That means the other bits of our number can be arbitrary: 110xxxxx. And that’s exactly what the pattern in the Wikipedia article predicts for the first byte of a two byte character word.

And last the inner for loop is to check the sanity of the following bytes of a multi-byte character. Those all must begin with 10xxxxxx.

Solution 3

If you know a little about UTF-8 it's a pretty simple implementation.

function seems_utf8($str) {
 # get length, for utf8 this means bytes and not characters
 $length = strlen($str);  

 # we need to check each byte in the string
 for ($i=0; $i < $length; $i++) {

  # get the byte code 0-255 of the i-th byte
  $c = ord($str[$i]);

  # utf8 characters can take 1-6 bytes, how much
  # exactly is decoded in the first character if 
  # it has a character code >= 128 (highest bit set).
  # For all <= 127 the ASCII is the same as UTF8.
  # The number of bytes per character is stored in 
  # the highest bits of the first byte of the UTF8 
  # character. The bit pattern that must be matched
  # for the different length are shown as comment.
  #
  # So $n will hold the number of additonal characters

  if ($c < 0x80) $n = 0; # 0bbbbbbb
  elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
  elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
  elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
  elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
  elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
  else return false; # Does not match any model

  # the code now checks the following additional bytes
  # First if expression checks that the byte is really inside the
  # string and not running over the string end.
  # The second expression just check that the highest two bits of all 
  # additonal bytes are always 1 and 0 (hexadecimal 0x80)
  # which is a requirement for all additional UTF-8 bytes

  for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
   if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
    return false;
  }
 }
 return true;
}

By the way. On PHP i assume that this is a factor 50-100 slower then a C function so you shouldn't really use it on long strings and production systems.

13,224

Author by

Bojan Muvrin

please delete me

Updated on July 03, 2022

Comments

Bojan Muvrin almost 2 years

function seems_utf8($str) {
 $length = strlen($str);
 for ($i=0; $i < $length; $i++) {
  $c = ord($str[$i]);
  if ($c < 0x80) $n = 0; # 0bbbbbbb
  elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
  elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
  elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
  elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
  elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
  else return false; # Does not match any model
  for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
   if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
    return false;
  }
 }
 return true;
}

I got this code from Wordpress, I don't know much about this, but I would like to know what exactly happing in that function.

If any one know please help me out?

I need the clear idea about the above code. If line by line explanation will be more helpful.