UTF8 Encoding problem - With good examples

php mysql utf-8 character-encoding

81,952

Solution 1

This may be a job for the mb_detect_encoding() function.

In my limited experience with it, it's not 100% reliable when used as a generic "encoding sniffer" - It checks for the presence of certain characters and byte values to make an educated guess - but in this narrow case (it'll need to distinguish just between UTF-8 and ISO-8859-1 ) it should work.

<?php
$text = $entity['Entity']['title'];

echo 'Original : ', $text."<br />";
$enc = mb_detect_encoding($text, "UTF-8,ISO-8859-1");

echo 'Detected encoding '.$enc."<br />";

echo 'Fixed result: '.iconv($enc, "UTF-8", $text)."<br />";

?>

you may get incorrect results for strings that do not contain special characters, but that is not a problem.

Solution 2

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

<?php
$text = $entity['Entity']['title'];
echo 'Original : ', $text."<br />";
echo 'Encoding::toUTF8 : ', Encoding::toUTF8($text)."<br />";
?>

Output:

Original : France Télécom
Encoding::toUTF8 : France Télécom

Original : Cond� Nast Publications
Encoding::toUTF8 : Condé Nast Publications

You dont need to know what the encoding of your strings is as long as you know it is either on Latin1 (iso 8859-1), Windows-1252 or UTF8. The string can have a mix of them too.

Encoding::toUTF8() will convert everything to UTF8.

I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.

Usage:

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download:

http://dl.dropbox.com/u/186012/PHP/forceUTF8.zip

I've included another function, Encoding::fixUFT8(), wich will fix every UTF8 string that looks garbled.

Usage:

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Solution 3

Another way, maybe faster and less unreliable:

echo (strlen($str)!==strlen(utf8_decode($str)))
  ? $str                //is multibyte, leave as is
  : utf8_encode($str);  //encode

It compares the length of the original string and the utf8_decoded string. A string that contains a multibyte-character, has a strlen which differs from the similar singlebyte-encoded strlen.

For example:

strlen('Télécom')

should return 7 in Latin1 and 9 in UTF8

Solution 4

I made these little 2 functions that work well with UTF-8 and ISO-8859-1 detection / conversion...

function detect_encoding($string)
{
    //http://w3.org/International/questions/qa-forms-utf-8.html
    if (preg_match('%^(?: [\x09\x0A\x0D\x20-\x7E] | [\xC2-\xDF][\x80-\xBF] | \xE0[\xA0-\xBF][\x80-\xBF] | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | \xED[\x80-\x9F][\x80-\xBF] | \xF0[\x90-\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2} )*$%xs', $string))
        return 'UTF-8';

    //If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list.
    //if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.
    return mb_detect_encoding($string, array('UTF-8', 'ASCII', 'ISO-8859-1', 'JIS', 'EUC-JP', 'SJIS'));
}

function convert_encoding($string, $to_encoding, $from_encoding = '')
{
    if ($from_encoding == '')
        $from_encoding = detect_encoding($string);

    if ($from_encoding == $to_encoding)
        return $string;

    return mb_convert_encoding($string, $to_encoding, $from_encoding);
}

If your database contains strings in 2 different charsets, what I would do instead of plaguing all your application code with charset detection / conversion is to writhe a "one shot" script that will read all of your tables records and update their strings to the correct format (I would pick UTF-8 if I were you). This way your code will be cleaner and simpler to maintain.

Just loop records in every tables of your database and convert strings like this:

//if the 3rd param is not specified the "from encoding" is detected automatically
$newString = convert_encoding($oldString, 'UTF-8');

View more solutions

81,952

Lizard

I am a PHP Web Developer

Updated on July 09, 2022

Comments

Lizard almost 2 years

I have the following character encoding issue, somehow I have managed to save data with different character encoding into my database (UTF8) The code and outputs below show 2 sample strings and how they output. 1 of them would need to be changed to UTF8 and the other already is.

How do/should I go about checking if I should encode the string or not? e.g. I need each string to be outputted correctly, so how do I check if it is already utf8 or whether it needs to be converted?

I am using PHP 5.2, mysql myisam tables:

CREATE TABLE IF NOT EXISTS `entities` (
  ....
  `title` varchar(255) NOT NULL
  ....
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

<?php
$text = $entity['Entity']['title'];
echo 'Original : ', $text."<br />";
echo 'UTF8 Encode : ', utf8_encode($text)."<br />";
echo 'UTF8 Decode : ', utf8_decode($text)."<br />";
echo 'TRANSLIT : ', iconv("ISO-8859-1", "UTF-8//TRANSLIT", $text)."<br />";
echo 'IGNORE TRANSLIT : ', iconv("ISO-8859-1", "UTF-8//IGNORE//TRANSLIT", $text)."<br />";
echo 'IGNORE   : ', iconv("ISO-8859-1", "UTF-8//IGNORE", $text)."<br />";
echo 'Plain    : ', iconv("ISO-8859-1", "UTF-8", $text)."<br />";
?>

Output 1:

Original : France Télécom
UTF8 Encode : France TÃ©lÃ©com
UTF8 Decode : France T�l�com
TRANSLIT : France TÃ©lÃ©com
IGNORE TRANSLIT : France TÃ©lÃ©com
IGNORE : France TÃ©lÃ©com
Plain : France TÃ©lÃ©com

Output 2:###

Original : Cond� Nast Publications
UTF8 Encode : Condé Nast Publications
UTF8 Decode : Cond?ast Publications
TRANSLIT : Condé Nast Publications
IGNORE TRANSLIT : Condé Nast Publications
IGNORE : Condé Nast Publications
Plain : Condé Nast Publications

Thanks for you time on this one. Character encoding and I don't get on very well!

UPDATE:

echo strlen($string)."|".strlen(utf8_encode($string))."|";
echo (strlen($string)!==strlen(utf8_encode($string))) ? $string : utf8_encode($string);
echo "<br />";
echo strlen($string)."|".strlen(utf8_decode($string))."|";
echo (strlen($string)!==strlen(utf8_decode($string))) ? $string : utf8_decode($string);
echo "<br />";

23|24|Cond� Nast Publications
23|21|Cond� Nast Publications

16|20|France Télécom
16|14|France Télécom

Pekka over 13 years

From the look of it, the first string is already UTF-8, and the second one is ISO-8859-1. But what is your question?
Lizard over 13 years

I each string to be outputted correctly, so how do I check if it is already utf8 or whether it needs to be converted?
Richard Knop over 13 years

Not sure but have a look here - dev.mysql.com/doc/refman/5.0/en/… - with a good combination of mysql functions you could do what you want just with a single update query.
Dr.Molle over 13 years

I also think that fixing the DB once is better than re-encoding the string on every request.

Richard Knop over 13 years

According to my experience mb_detect_encoding() is not reliable at all. I tried to use it in the past but it returns completely wrong encodings for so many strings.
Pekka over 13 years

@Richard it should work with such a narrow set of possible encodings (UTF-8 should be relatively easy to tell apart from ISO)... We'll see how it works out
Richard Knop over 13 years

Yeah this seems to be the best option. Still he should backup his database before doing anything :)
Pekka over 13 years

This method should be possible to apply in the database directly, too, by converting the character set on the fly and comparing the byte length (I think mySQL has a function for that) ... Just as an idea to fix the database more quickly
Dr.Molle over 13 years

In my experience the order of encoding_list matters. "UTF-8,ISO-8859-1" will give other results than "ISO-8859-1,UTF-8"
Pekka over 13 years

@Lizard I think you implemented it wrongly. You need to output a utf8_decode to see whether it worked out (you're outputting a encoded version twice)
Pekka over 13 years

Doesn't address the OP's issue. He has a data set with two mixed encodings, and it's unknown to him which row is which.
Davis Peixoto over 13 years

I see... I gave an overall answer. Not so good to the case, and don't really address @Lizard's issue. @Pekka's and @Dr.Molle's are on right track. Need a function to detect and convert as needed.
CIRCLE over 10 years

I was going crazy with this, thank so much @Pekka 웃 for this solution
danielpopa over 7 years

Thanks! can you create a github with this, I will gladly want to make some upgrades.
Mário Rodrigues over 6 years

For me, this was the only solution. Had some issues with SQL Server Database. Thank you @Pekka웃 for sharing!