How do I find the number of bytes within UTF-8 string with PHP?

12,214

Solution 1

I am asking this as I need to shorten a utf-8 string to a certain number of bytes.

mb_strcut() does exactly this, though you might not be able to tell from the barely comprehensible documentation.

Solution 2

strlen() returns the number of bytes.

Shortening a multibyte string to a certain number of bytes is a separate task. You will need to take care not to chop the string off in the middle of a multibyte sequence as you shorten it.

The other thing you need to handle is that when you put a string into json notation, it might need more bytes to represent it as json. For example, if your string contains a double quote character. It needs to be escaped, and the backslash character will add one byte. There's other characters that need to be escaped too. Point is, it can get larger. I assume the byte limit is on the total json payload, so you do need to account for the json syntax itself, as well as any escaping that json will impose on your string.

An unoptimized, kinda hacky way to do it is to chop the string, at say 5 bytes more than your limit, using substr(). Now use mb_strlen() to get number of characters, and mb_substr() to remove the last character. Now encode it as json, and measure the bytes via strlen(). Enter a loop, which keeps chopping off the last character using mb_substr(), encodes as json, and again measure bytes using strlen(). The loop terminates when the number of bytes is acceptable.

Solution 3

If you wish to find the byte length of a multi-byte string when you are using mbstring.func_overload 2 and UTF-8 strings, then you can use the following:

mb_strlen($utf8_string, 'latin1');

Solution 4

In PHP 5, mb_strlen should return the number of characters ; and strlen should return the number of bytes.

For instance, this portion of code :

$string = 'По своей природе компьютеры могут работать лишь с числами. И для того, чтобы они могли хранить в памяти буквы или другие символы, каждому такому символу должно быть поставлено в соответствие число';
echo mb_strlen($string, 'UTF-8') . '<br />';
echo strlen($string);

Should get you the following output :

196
359


As a sidenote : this is one the the things that PHP 6 will change : PHP 6 will be using Unicode by default, which means strlen should, in PHP 6, return the number of characters, and not a number of bytes anymore.

Share:
12,214
Luke
Author by

Luke

Favorite Languages: .NET (VB.NET and C#), WPF (XAML), PHP, XHTML, CSS and JavaScript (jQuery) Least Favorite Languages: Objective-C, Java Interests: Programming, User Interface Design, Microsoft, Google and Apple

Updated on July 19, 2022

Comments

  • Luke
    Luke almost 2 years

    I have the following function from the php.net site to determine the # of bytes in an ASCII and UTF-8 string:

    <?php 
    /** 
     * Count the number of bytes of a given string. 
     * Input string is expected to be ASCII or UTF-8 encoded. 
     * Warning: the function doesn't return the number of chars 
     * in the string, but the number of bytes. 
     * 
     * @param string $str The string to compute number of bytes 
     * 
     * @return The length in bytes of the given string. 
     */ 
    function strBytes($str) 
    { 
      // STRINGS ARE EXPECTED TO BE IN ASCII OR UTF-8 FORMAT 
    
      // Number of characters in string 
      $strlen_var = strlen($str); 
    
      // string bytes counter 
      $d = 0; 
    
     /* 
      * Iterate over every character in the string, 
      * escaping with a slash or encoding to UTF-8 where necessary 
      */ 
      for ($c = 0; $c < $strlen_var; ++$c) { 
    
          $ord_var_c = ord($str{$d}); 
    
          switch (true) { 
              case (($ord_var_c >= 0x20) && ($ord_var_c <= 0x7F)): 
                  // characters U-00000000 - U-0000007F (same as ASCII) 
                  $d++; 
                  break; 
    
              case (($ord_var_c & 0xE0) == 0xC0): 
                  // characters U-00000080 - U-000007FF, mask 110XXXXX 
                  // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
                  $d+=2; 
                  break; 
    
              case (($ord_var_c & 0xF0) == 0xE0): 
                  // characters U-00000800 - U-0000FFFF, mask 1110XXXX 
                  // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
                  $d+=3; 
                  break; 
    
              case (($ord_var_c & 0xF8) == 0xF0): 
                  // characters U-00010000 - U-001FFFFF, mask 11110XXX 
                  // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
                  $d+=4; 
                  break; 
    
              case (($ord_var_c & 0xFC) == 0xF8): 
                  // characters U-00200000 - U-03FFFFFF, mask 111110XX 
                  // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
                  $d+=5; 
                  break; 
    
              case (($ord_var_c & 0xFE) == 0xFC): 
                  // characters U-04000000 - U-7FFFFFFF, mask 1111110X 
                  // see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 
                  $d+=6; 
                  break; 
              default: 
                $d++;    
          } 
      } 
    
      return $d; 
    } 
    ?> 
    

    However when I try this with Russian (e.g. По своей природе компьютеры могут работать лишь с числами. И для того, чтобы они могли хранить в памяти буквы или другие символы, каждому такому символу должно быть поставлено в соответствие число.). It doesn't seem to return the correct number of bytes.

    The switch statement is using the default condition. Any ideas why Russian characters would not be working as expected? Or would there be better options for this.

    I am asking this as I need to shorten a UTF-8 string to a certain number of bytes. i.e. I can only send a max. of 169 bytes of JSON data to the iPhone APNS in my situation (excluding the other packet data).

    Reference: PHP strlen - Manual (Paolo Comment on 10-Jan-2007 03:58)

  • Luke
    Luke about 14 years
    Doesn't this just give the string length in the # of characters? I need to know the actual number of bytes that is being used. Within utf-8 a character can be more than one byte, correct?
  • Xorlev
    Xorlev about 14 years
    Even with PHP5 that's not an assumption you can make. strlen() may or may not be overloaded by mb_strlen(). It's safer just to call mb_strlen($string, 'latin1');
  • Luke
    Luke about 14 years
    The function I have provided in the question seems to work fine for utf-8. I believe the issue to my problem is somewhere else in the iPhone PUSH APNS code. I seem to be able to PUSH around 160 bytes of Japanese, English text etc. However I can only PUSH around 110 bytes of Cyrillic (Russian) characters.
  • Luke
    Luke about 14 years
    I still believe that strlen and mb_strlen cannot be relied on to determine the actual bytes.
  • Luke
    Luke about 14 years
    I already have a while loop that keeps chopping 1 character at a time using mb_substr until the bytes falls below the limit. strlen, doesn't seem to return the same # of bytes as the function in my question. strlen() may or may not be overloaded by mb_strlen() as per other comments, due to this it shouldn't be relied on.
  • goat
    goat about 14 years
    So don't overload strlen. If you don't control it, then there's other ways. Eg while (isset($str[$i])) $i++; will do the trick. Or fwrite() it to a stream or something...
  • Phil Rykoff
    Phil Rykoff about 14 years
    according to the comments section of php.net/manual/en/function.mb-strlen.php (very bottom), it's widely agreed upon that this function called in the way described will count the BYTES. when you tell the function, your input string contains latin1 (ergo: ascii) chars, he may count every byte as a character, though it may be not a valid character in ascii-sense. could you try this out? i sorrily don't have an mb-enabled environment...
  • Luke
    Luke about 14 years
    Thank you, using mb_strcut() is better than mb_substr() for my situation.
  • David Spector
    David Spector over 4 years
    PHP 6? PHP 6? It looks unlikely that PHP will ever "use Unicode by default".