Fixing broken UTF-8 encoding

134,582

Solution 1

I've had to try to 'fix' a number of UTF8 broken situations in the past, and unfortunately it's never easy, and often rather impossible.

Unless you can determine exactly how it was broken, and it was always broken in that exact same way, then it's going to be hard to 'undo' the damage.

If you want to try to undo the damage, your best bet would be to start writing some sample code, where you attempt numerous variations on calls to mb_convert_encoding() to see if you can find a combination of 'from' and 'to' that fixes your data. In the end, it's often best to not even bother worrying about fixing the old data because of the pain levels involved, but instead to just fix things going forward.

However, before doing this, you need to make sure that you fix everything that is causing this issue in the first place. You've already mentioned that your DB table collation and editors are set properly. But there are more places where you need to check to make sure that everything is properly UTF-8:

  • Make sure that you are serving your HTML as UTF-8:
    • header("Content-Type: text/html; charset=utf-8");
  • Change your PHP default charset to utf-8:
    • ini_set("default_charset", 'utf-8');
  • If your database doesn't ALWAYS talk in utf-8, then you may need to tell it on a per connection basis to ensure it's in utf-8 mode, in MySQL you do that by issuing:
    • charset utf8
  • You may need to tell your webserver to always try to talk in UTF8, in Apache this command is:
    • AddDefaultCharset UTF-8
  • Finally, you need to ALWAYS make sure that you are using PHP functions that are properly UTF-8 complaint. This means always using the mb_* styled 'multibyte aware' string functions. It also means when calling functions such as htmlspecialchars(), that you include the appropriate 'utf-8' charset parameter at the end to make sure that it doesn't encode them incorrectly.

If you miss up on any one step through your whole process, the encoding can be mangled and problems arise. Once you get in the 'groove' of doing utf-8 though, this all becomes second nature. And of course, PHP6 is supposed to be fully unicode complaint from the getgo, which will make lots of this easier (hopefully)

Solution 2

If you have double-encoded UTF8 characters (various smart quotes, dashes, apostrophe ’, quotation mark “, etc), in mysql you can dump the data, then read it back in to fix the broken encoding.

Like this:

mysqldump -h DB_HOST -u DB_USER -p DB_PASSWORD --opt --quote-names \
    --skip-set-charset --default-character-set=latin1 DB_NAME > DB_NAME-dump.sql

mysql -h DB_HOST -u DB_USER -p DB_PASSWORD \
    --default-character-set=utf8 DB_NAME < DB_NAME-dump.sql

This was a 100% fix for my double encoded UTF-8.

Source: http://blog.hno3.org/2010/04/22/fixing-double-encoded-utf-8-data-in-mysql/

Solution 3

If you utf8_encode() on a string that is already UTF-8 then it looks garbled when it is encoded multiple times.

I made a function toUTF8() that converts strings into UTF-8.

You don't need to specify what the encoding of your strings is. It can be Latin1 (iso 8859-1), Windows-1252 or UTF8, or a mix of these three.

I used this myself on a feed with mixed encodings in the same string.

Usage:

$utf8_string = Encoding::toUTF8($mixed_string);

$latin1_string = Encoding::toLatin1($mixed_string);

My other function fixUTF8() fixes garbled UTF8 strings if they were encoded into UTF8 multiple times.

Usage:

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Download:

https://github.com/neitanod/forceutf8

Solution 4

I had a problem with an xml file that had a broken encoding, it said it was utf-8 but it had characters that where not utf-8.
After several trials and errors with the mb_convert_encoding() I manage to fix it with

mb_convert_encoding($text, 'Windows-1252', 'UTF-8')

Solution 5

As Dan pointed out: you need to convert them to binary and then convert/correct the encoding.

E.g., for utf8 stored as latin1 the following SQL will fix it:

UPDATE table
   SET field = CONVERT( CAST(field AS BINARY) USING utf8)
 WHERE $broken_field_condition
Share:
134,582
user2480201
Author by

user2480201

geeky culture

Updated on July 05, 2022

Comments

  • user2480201
    user2480201 almost 2 years

    I am in the process of fixing some bad UTF-8 encoding. I am currently using PHP 5 and MySQL.

    In my database I have a few instances of bad encodings that print like: î

    • The database collation is utf8_general_ci
    • PHP is using a proper UTF-8 header
    • Notepad++ is set to use UTF-8 without BOM
    • database management is handled in phpMyAdmin
    • not all cases of accented characters are broken

    I need some sort of function that will help me map the instances of î, í, ü and others like it to their proper accented UTF-8 characters.

  • Paul Weber
    Paul Weber over 14 years
    Thank you very much! Because there are also many correctly encoded Strings in the DB, wich makes the Problem worse, i chose to str_replace the Strings i know that are corrupt with their correct Characters. It works great. I have already implemented most of your Tips regarding PHP and Server Setup, but it is a great summary, so i would chose this as the Answer, because my solution is not really beautiful.
  • MtnViewMark
    MtnViewMark about 14 years
    One important note on this advice: Do NOT add 'utf-8' as the second argument to the function htmlspecialchars(). Without the argument, that function does the correct thing with UTF-8 strings, since it ignores all bytes with the high bit set and passes them. This will preserve them and "does the right thing". With 'utf-8', htmlspecialchars() interprets the UTF-8 string - but doesn't handle characters outside the BMP (those with code points U+10000 and above, encoded in four bytes). It incorrectly encodes those that happen to match the specials mod 65536.. The behavior is both slower and wrong.
  • user2480201
    user2480201 almost 14 years
    interesting; i'll remember this if i have the issue again. thanks
  • Eli
    Eli about 13 years
    Makes sense. I guess it's really double-encoded, it's just that the field is marked latin1 even though it really contains UTF8, so when you request the field as UTF8 it encodes it again.
  • Energiequant
    Energiequant about 13 years
    Seems to have successfully converted a Typo3 database for me. Thanks for posting this; it's much cleaner than any other conversion method. :)
  • Frost
    Frost over 12 years
    I wish I could give you more upvotes, you really really deserve them.
  • Diego Pino
    Diego Pino over 12 years
    Man, you made my day, it worked for me. Now I'd like to understand the real reason why the dump I'm working with has these wrong characters (maybe it was correctly encoded in utf-8 but the dump process printed the output as latin1)
  • Sebastián Grignoli
    Sebastián Grignoli over 12 years
    Please, see my answer below. I addressed all this problems in a single pure-PHP function: fixUTF8(). You don't need to change your server configuration, and you don't even need to have the multi byte functions installed. The function is smart enough to fix any character independently, even if the encoding is mixed inside the same string (no matter how many times it was converted or if it's in UTF8 already).
  • Sebastián Grignoli
    Sebastián Grignoli over 12 years
    Take a look at my answer. The function Encoding::fixUTF8(). It fixes all UTF8 characters (there are millions of them), and can handle strings encoded multiple times, not only twice.
  • Kristopher Ives
    Kristopher Ives over 12 years
    Seems to do the trick. I don't use it for normal output, but I do enjoy using your class for data migration help.
  • Sebastián Grignoli
    Sebastián Grignoli over 12 years
    Thanks. It's magical, isn't it? I think this little piece of code is one of the most satisfying things I've produced, in terms of problems solved with it. :-)
  • Prine
    Prine about 12 years
    Yep, also worked for me! Thanks to you sharing it here and thanks to the owner of the blog :)
  • Sebastián Grignoli
    Sebastián Grignoli about 12 years
    I recommend using it for migrations, as Kristopher said, but not in a production environment. There are cases where you would want the "garbled string" to stay garbled, like in this answer.
  • Nick Johnson
    Nick Johnson over 11 years
    I have struggled with third party systems that have mixed encoding. I tested your class out, and it works well. I just ran it on fields in our database that stored outside input with mixed encoding, and it cleaned everything up. Now I am implementing it at our insert junctions. PDO doesn't identify mixed encoding by the way, thus your solution rocks!
  • Yves Van Broekhoven
    Yves Van Broekhoven over 11 years
    Ran into the problem when transferring a Wordpress DB from staging to local environment by exporting it with Sequel Pro.
  • sieppl
    sieppl over 11 years
    Great library, thank you very much! It helped me to fix thousends of broken file names, that occured by copying the files from linux to windows via FTP and back.
  • andig
    andig about 11 years
    +1 excellent- fixUTF8 even takes care of some weird encoding errors I've seen.
  • Walter81
    Walter81 about 11 years
    Kudos! I've been struggling a long time with this too. Until I -accidentially- found this! Many thanks.
  • Desmond Hume
    Desmond Hume almost 10 years
    "FÃÂédÃÂération Camerounaise de Football" doesn't seem to work, others do
  • Titan
    Titan about 9 years
    This worked for me after days of banging my head over the issue (everything was UTF-8 end to end but in RSS it wasn't!) Thank you!
  • Jose Nobile
    Jose Nobile over 8 years
    PHP 6 was skipped, PHP 7 will be in one month a stable release.
  • Raja Khoury
    Raja Khoury over 8 years
    Thank you Sebastian. This is really helpful
  • user828591
    user828591 over 8 years
    Works perfectly! I also had to fix an old TYPO3 database and this just did the trick!
  • Simon B.
    Simon B. about 8 years
    All comments so far on this answer - including mine!! - are completely useless and just add noise. Aargh!
  • SuN
    SuN about 8 years
    WHERE LENGTH( field ) != CHAR_LENGTH( field ) ;)
  • fnkr
    fnkr almost 8 years
    Thanks! This is working for me: ssh user@host 'mysqldump --skip-set-charset --default-character-set=latin1 dbname' | mysql --default-character-set=utf8 dbname
  • JustBaron
    JustBaron almost 8 years
    Yep, super script! Had an issue with incorrect database encoding during migration. This solved it.
  • Avatar
    Avatar over 7 years
    My problem was: Database fields saved as latin1_swedish_ci, output by PHP as utf-8 showing Umlaute ü as ü and ö as ö. This helped to fix this.
  • Revious
    Revious almost 7 years
    @Jayrox: There is a better answer witg a tool from github: stackoverflow.com/a/3521340/196210
  • Revious
    Revious almost 7 years
    @SebastiánGrignoli: it would be very nice but it doesn't work on my code... Encoding::fixUTF8("luminositÃ?") doesn't solve the issue. Any suggestion?
  • Revious
    Revious almost 7 years
    @SebastiánGrignoli: have a look to this page. This solves every problem i18nqa.com/debug/utf8-debug.html
  • StefanJM
    StefanJM about 5 years
    almost a week of trying to figure out what was going on, and here's a solution that fixes it in a minute:)
  • Rick James
    Rick James almost 4 years
    Those look like repeatedly treating a utf8 string as if it were latin1 and converting it to utf8. See "double encoding" in stackoverflow.com/questions/38363566/…
  • IMSoP
    IMSoP about 3 years
    What? That doesn't even begin to make sense!
  • Buffalo
    Buffalo over 2 years
    Doesn't work for some characters, but works good enough. Thanks!
  • Mark
    Mark about 2 years
    This has been such a life saver I cannot express how great it has been. I haven't been able to find anything else like it!! Ive been using it to do real time translation of strings (not a one time fix for bad UTF), do you think its a major resource hog to use it that way?