UTF8 Encoding problem - With good examples

81,952

Solution 1

This may be a job for the mb_detect_encoding() function.

In my limited experience with it, it's not 100% reliable when used as a generic "encoding sniffer" - It checks for the presence of certain characters and byte values to make an educated guess - but in this narrow case (it'll need to distinguish just between UTF-8 and ISO-8859-1 ) it should work.

<?php
$text = $entity['Entity']['title'];

echo 'Original : ', $text."<br />";
$enc = mb_detect_encoding($text, "UTF-8,ISO-8859-1");

echo 'Detected encoding '.$enc."<br />";

echo 'Fixed result: '.iconv($enc, "UTF-8", $text)."<br />";

?>

you may get incorrect results for strings that do not contain special characters, but that is not a problem.

Solution 2

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

<?php
$text = $entity['Entity']['title'];
echo 'Original : ', $text."<br />";
echo 'Encoding::toUTF8 : ', Encoding::toUTF8($text)."<br />";
?>

Output:

Original : France Télécom
Encoding::toUTF8 : France Télécom

Original : Cond� Nast Publications
Encoding::toUTF8 : Condé Nast Publications

You dont need to know what the encoding of your strings is as long as you know it is either on Latin1 (iso 8859-1), Windows-1252 or UTF8. The string can have a mix of them too.

Encoding::toUTF8() will convert everything to UTF8.

I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.

Usage:

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download:

http://dl.dropbox.com/u/186012/PHP/forceUTF8.zip

I've included another function, Encoding::fixUFT8(), wich will fix every UTF8 string that looks garbled.

Usage:

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Solution 3

Another way, maybe faster and less unreliable:

echo (strlen($str)!==strlen(utf8_decode($str)))
  ? $str                //is multibyte, leave as is
  : utf8_encode($str);  //encode

It compares the length of the original string and the utf8_decoded string. A string that contains a multibyte-character, has a strlen which differs from the similar singlebyte-encoded strlen.

For example:

strlen('Télécom') 

should return 7 in Latin1 and 9 in UTF8

Solution 4

I made these little 2 functions that work well with UTF-8 and ISO-8859-1 detection / conversion...

function detect_encoding($string)
{
    //http://w3.org/International/questions/qa-forms-utf-8.html
    if (preg_match('%^(?: [\x09\x0A\x0D\x20-\x7E] | [\xC2-\xDF][\x80-\xBF] | \xE0[\xA0-\xBF][\x80-\xBF] | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | \xED[\x80-\x9F][\x80-\xBF] | \xF0[\x90-\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2} )*$%xs', $string))
        return 'UTF-8';

    //If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list.
    //if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.
    return mb_detect_encoding($string, array('UTF-8', 'ASCII', 'ISO-8859-1', 'JIS', 'EUC-JP', 'SJIS'));
}

function convert_encoding($string, $to_encoding, $from_encoding = '')
{
    if ($from_encoding == '')
        $from_encoding = detect_encoding($string);

    if ($from_encoding == $to_encoding)
        return $string;

    return mb_convert_encoding($string, $to_encoding, $from_encoding);
}

If your database contains strings in 2 different charsets, what I would do instead of plaguing all your application code with charset detection / conversion is to writhe a "one shot" script that will read all of your tables records and update their strings to the correct format (I would pick UTF-8 if I were you). This way your code will be cleaner and simpler to maintain.

Just loop records in every tables of your database and convert strings like this:

//if the 3rd param is not specified the "from encoding" is detected automatically
$newString = convert_encoding($oldString, 'UTF-8');
Share:
81,952

Related videos on Youtube

Lizard
Author by

Lizard

I am a PHP Web Developer

Updated on July 09, 2022

Comments

  • Lizard
    Lizard almost 2 years

    I have the following character encoding issue, somehow I have managed to save data with different character encoding into my database (UTF8) The code and outputs below show 2 sample strings and how they output. 1 of them would need to be changed to UTF8 and the other already is.

    How do/should I go about checking if I should encode the string or not? e.g. I need each string to be outputted correctly, so how do I check if it is already utf8 or whether it needs to be converted?

    I am using PHP 5.2, mysql myisam tables:

    CREATE TABLE IF NOT EXISTS `entities` (
      ....
      `title` varchar(255) NOT NULL
      ....
    ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
    
    <?php
    $text = $entity['Entity']['title'];
    echo 'Original : ', $text."<br />";
    echo 'UTF8 Encode : ', utf8_encode($text)."<br />";
    echo 'UTF8 Decode : ', utf8_decode($text)."<br />";
    echo 'TRANSLIT : ', iconv("ISO-8859-1", "UTF-8//TRANSLIT", $text)."<br />";
    echo 'IGNORE TRANSLIT : ', iconv("ISO-8859-1", "UTF-8//IGNORE//TRANSLIT", $text)."<br />";
    echo 'IGNORE   : ', iconv("ISO-8859-1", "UTF-8//IGNORE", $text)."<br />";
    echo 'Plain    : ', iconv("ISO-8859-1", "UTF-8", $text)."<br />";
    ?>
    

    Output 1:

    Original : France Télécom
    UTF8 Encode : France Télécom
    UTF8 Decode : France T�l�com
    TRANSLIT : France Télécom
    IGNORE TRANSLIT : France Télécom
    IGNORE : France Télécom
    Plain : France Télécom
    

    Output 2:###

    Original : Cond� Nast Publications
    UTF8 Encode : Condé Nast Publications
    UTF8 Decode : Cond?ast Publications
    TRANSLIT : Condé Nast Publications
    IGNORE TRANSLIT : Condé Nast Publications
    IGNORE : Condé Nast Publications
    Plain : Condé Nast Publications
    

    Thanks for you time on this one. Character encoding and I don't get on very well!

    UPDATE:

    echo strlen($string)."|".strlen(utf8_encode($string))."|";
    echo (strlen($string)!==strlen(utf8_encode($string))) ? $string : utf8_encode($string);
    echo "<br />";
    echo strlen($string)."|".strlen(utf8_decode($string))."|";
    echo (strlen($string)!==strlen(utf8_decode($string))) ? $string : utf8_decode($string);
    echo "<br />";
    
    23|24|Cond� Nast Publications
    23|21|Cond� Nast Publications
    
    16|20|France Télécom
    16|14|France Télécom
    
    • Pekka
      Pekka over 13 years
      From the look of it, the first string is already UTF-8, and the second one is ISO-8859-1. But what is your question?
    • Lizard
      Lizard over 13 years
      I each string to be outputted correctly, so how do I check if it is already utf8 or whether it needs to be converted?
    • Richard Knop
      Richard Knop over 13 years
      Not sure but have a look here - dev.mysql.com/doc/refman/5.0/en/… - with a good combination of mysql functions you could do what you want just with a single update query.
    • Dr.Molle
      Dr.Molle over 13 years
      I also think that fixing the DB once is better than re-encoding the string on every request.
  • Richard Knop
    Richard Knop over 13 years
    According to my experience mb_detect_encoding() is not reliable at all. I tried to use it in the past but it returns completely wrong encodings for so many strings.
  • Pekka
    Pekka over 13 years
    @Richard it should work with such a narrow set of possible encodings (UTF-8 should be relatively easy to tell apart from ISO)... We'll see how it works out
  • Richard Knop
    Richard Knop over 13 years
    Yeah this seems to be the best option. Still he should backup his database before doing anything :)
  • Pekka
    Pekka over 13 years
    This method should be possible to apply in the database directly, too, by converting the character set on the fly and comparing the byte length (I think mySQL has a function for that) ... Just as an idea to fix the database more quickly
  • Dr.Molle
    Dr.Molle over 13 years
    In my experience the order of encoding_list matters. "UTF-8,ISO-8859-1" will give other results than "ISO-8859-1,UTF-8"
  • Pekka
    Pekka over 13 years
    @Lizard I think you implemented it wrongly. You need to output a utf8_decode to see whether it worked out (you're outputting a encoded version twice)
  • Pekka
    Pekka over 13 years
    Doesn't address the OP's issue. He has a data set with two mixed encodings, and it's unknown to him which row is which.
  • Davis Peixoto
    Davis Peixoto over 13 years
    I see... I gave an overall answer. Not so good to the case, and don't really address @Lizard's issue. @Pekka's and @Dr.Molle's are on right track. Need a function to detect and convert as needed.
  • CIRCLE
    CIRCLE over 10 years
    I was going crazy with this, thank so much @Pekka 웃 for this solution
  • danielpopa
    danielpopa over 7 years
    Thanks! can you create a github with this, I will gladly want to make some upgrades.
  • Mário Rodrigues
    Mário Rodrigues over 6 years
    For me, this was the only solution. Had some issues with SQL Server Database. Thank you @Pekka웃 for sharing!