PHP: Is it possible to correctly SUBSTR a UTF-8 string?

22,479

Solution 1

As usual, the answer appears to have been here. (Honestly, I have searched for about an hour)

An answer at (鉑) string functions and UTF8 in php reads:

Make sure you set the proper internal encoding: mb_internal_encoding('utf-8');

With this mb_internal_encoding('utf-8'); everything works fine. Sorry to bother you guys, thanks for help.

Solution 2

See below URL:

Extracting a substring from a UTF-8 string in PHP

http://osc.co.cr/extracting-a-substring-from-a-utf-8-string-in-php/

PHP substring with UTF-8

http://greekgeekz.blogspot.in/2010/11/php-substring-with-utf-8.html

Or try it:

Example #1

$str1 = utf8_encode("Feliz día");

$str2 = substr($str1, 0, 9);

echo utf8_decode($str2); 

// will output Feliz d�

Example #2

$str3 = mb_substr($str1, 0, 9, 'UTF-8');

echo utf8_decode($str3); 

// will output Feliz dí

As of PHP >= 5.3 you can also declare the encoding directive and use the substr function

Example #3

declare(encoding='UTF-8');

$str4 = "Feliz día";

$str5 = substr($str4, 0, 9);echo $str5;


// will output Feliz dí

Solution 3

Try mb_strcut().
Its behavior is same to substr(), except it doesn't leave the last character to be broken.
If at the position you are trying to cut out, have a multibyte character with 2 or more bytes, mb_strcut() will not cut the character into pieces, but will ignore this character.

For instance, if your are trying to cut out 50 bytes out of the string Лампа в вытяжке на кухне меняется, начиная с вытаскивания белого штырька справа., mb_strcut() will not cut the character н in half, but will eliminate it from the result.

$str = "Лампа в вытяжке на кухне меняется, начиная с вытаскивания белого штырька справа.";

echo mb_strcut($str, 0, 50);
// Prints: Лампа в вытяжке на кухне ме

echo substr($str, 0, 50);
// Prints: Лампа в вытяжке на кухне ме�

echo mb_substr($str, 0, 50);
// Prints: Лампа в вытяжке на кухне меняется, начиная с вытас

Hope it helps.

Share:
22,479
texnic
Author by

texnic

Updated on July 09, 2022

Comments

  • texnic
    texnic almost 2 years

    I have (in an SQLite database) the following string:

    Лампа в вытяжке на кухне меняется, начиная с вытаскивания белого штырька справа.

    The string is correctly shown by PHP using print. I would like to obtain just the first 50 chars of this string, i.e.

    Лампа в вытяжке на кухне меняется, начиная с вытас.

    I have tried using both the substr and mb_substr, and get

    Лампа в вытяжке на кухне ме�, i.e. only 28 chars.

    After reading here and elsewhere about the problems of mbstring, I realise that this is actually a 50 byte string (22 Russian chars = 44 bytes plus 5 spaces plus 1 question symbol).

    Is there any nice solution to this? All my strings are UTF-8, so I could of course program a substr-function myself, by checking the first bit of every byte etc. But this should surely have been done before, right?

    UPDATE: I believe mb_substr does not work properly because mb_detect_encoding() does not work properly.

  • h2ooooooo
    h2ooooooo over 11 years
    As mentioned in my comment on the OP, I'm sure that mb_substr($string, 0, 50, "UTF-8") would've also worked, but I'm glad you found your solution (and hey, it's a much better solution if you're using mb_substr a lot of different places!)
  • texnic
    texnic over 11 years
    Though everything works, I like Example #3 most of all: it's better to use a single function. However, declare manual says: "The encoding declare value is ignored in PHP 5.3 unless php is compiled with --enable-zend-multibyte. Note that PHP does not expose whether --enable-zend-multibyte was used to compile PHP other than by phpinfo()." I believe, I'll stick to mb_ functions for now.