The dreaded MySQL import encoding issue - revisited

11,312

Solution 1

This is how I ended up solving my problem:

First mysqldump -uusername -ppassword --default-character-set=latin1 database -r dump.sql

Then run this script:

$search = array('/latin1/');
$replace = array('utf8');
foreach (range(128, 255) as $dec) {
    $search[] = "/\x".dechex($dec)."/";
    $replace[] = "&#$dec;";
}

$input = fopen('dump.sql', 'r');
$output = fopen('result.sql', 'w');

while (!feof($input)) {
    $line = fgets($input);
    $line = preg_replace($search, $replace, $line);
    fwrite($output, $line);
}

fclose($input);
fclose($output);

The script finds all the hex characters above 127 and encoded them into their HTML entities.

Then mysql -uusername -ppassword database < result.sql

Solution 2

A common problem with older WordPress databases and even newer ones is that the database tables get set as latin-1 but the contents are actually encoded as UTF-8. If you try to export as UTF-8 MySQL will attempt to convert the (supposedly) Latin-1 data to UTF-8 resulting in double encoded characters since the data was already UTF-8.

The solution is to export the tables as latin-1. Since MySQL thinks they are already latin-1 it will do a straight export.

Change the character set from ‘latin1′ to ‘utf8′. Since the dumped data was not converted during the export process, it’s actually UTF-8 encoded data.

Create your new table as UTF-8 If your CREATE TABLE command is in your SQL dump file, change the character set from ‘latin1′ to ‘utf8′.

Import your data normally. Since you’ve got UTF-8 encoded data in your dump file, the declared character set in the dump file is now UTF-8, and the table you’re importing into is UTF-8, everything will go smoothly

Solution 3

I was able to resolve this issue by modifying my wp-config.php as follows:

/** Database Charset to use in creating database tables. */
define('DB_CHARSET', 'utf8');

/** The Database Collate type. Don't change this if in doubt. */
define( 'DB_COLLATE', 'utf8_general_ci' );
Share:
11,312
Matt
Author by

Matt

Updated on July 03, 2022

Comments

  • Matt
    Matt almost 2 years

    I'm having the standard MySQL import encoding issue, but I can't seem to solve it.

    My client has had a WordPress installation running for some time. I've dumped the database to a file, and imported it locally. The resulting pages have a splattering of � characters throughout.

    I've inspected the database properties on both sides: production: show create database wordpress;

    CREATE DATABASE `wordpress` /*!40100 DEFAULT CHARACTER SET latin1 */
    

    local: show create database wordpress;

    CREATE DATABASE `wordpress` /*!40100 DEFAULT CHARACTER SET latin1 */
    

    production: show create table wp_posts;

    CREATE TABLE `wp_posts` (
      `ID` bigint(20) unsigned NOT NULL auto_increment,
      ...
      KEY `post_date_gmt` (`post_date_gmt`)
    ) ENGINE=MyISAM AUTO_INCREMENT=7932 DEFAULT CHARSET=utf8
    

    local: show create table wp_posts;

    CREATE TABLE `wp_posts` (
      `ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
      ...
      KEY `post_date_gmt` (`post_date_gmt`)
    ) ENGINE=MyISAM AUTO_INCREMENT=7918 DEFAULT CHARSET=utf8
    

    I've spent hours reading forums on how to squash the �, but I can't get anything to work. 99% of the answers say to match the character set between the databases. What I think should work if the following:

    mysqldump --opt --compress --default-character-set=latin1 -uusername -ppassword wordpress | ssh [email protected] mysql --default-character-set=latin1 -uusername -ppassword wordpress
    

    I've done it using the utf8 char-set as well. Still with the �'s.

    I've tried modifying the SQL dump directly, putting with utf8 or latin1 in the "SET names UTF8" line. Still with the �'s.

    Strange Symptoms

    I'd expect these � characters to appear in place of special characters in the content, like ñ or ö, but I've seen it where there would normally be just a space. I've also seen it in place of apostrophes (but not all apostrophes), double quotes, and trademark symbols.

    The � marks are pretty rare. They appear on average three to four times per page.

    I don't see any �'s when viewing the database through Sequel Pro (locally or live). I don't see any �'s in the SQL when viewing through Textmate.

    What am I missing?

    EDIT

    More info:

    I've tried to determine what the live database thinks the encoding is. I ran show table status, and it seems that the Collations are a mix of utf8_general_ci,utf8_binandlatin1_swedish_ci`. What are they different? Does it matter?

    I also ran: show variables like "character_set_database" and got latin1;

  • Matt
    Matt about 13 years
    That sounds exactly like what is happening.
  • Matt
    Matt about 13 years
    I tried the process you outlined. The export: mysqldump --default-character-set=latin1 -u username -ppassword wordpress > dump-20110512.sql. The import: mysql -uusername -ppassword wordpress < dump-20110512.utf8-1.sql. Now in any of the fields that contained a �, they are truncated at the first �. The import seemed to have ran without error. When I check out the SQL file, the INSERT statement that I'm using as a reference seems to be complete. I believe the mischevieous characters are still there. I see the text <92> where apostrophes should be.
  • Matt
    Matt about 13 years
    I changed every instance of latin1 to utf8 in the SQL file. Including some at the end of CREATE TABLE statements, where it had: ENGINE=MyISAM AUTO_INCREMENT=635 DEFAULT CHARSET=latin1;
  • Cobby
    Cobby almost 11 years
    mysql_* functions are deprecated and you should use PDO.
  • Joshua Soileau
    Joshua Soileau almost 11 years
    Thank you so much, this solved my problem! I've been working on this damn thing for two days. I was trying to get a copy of a database that was in turkish, and it had 'special' accented characters in it. When I imported it, it kept crashing wherever the special characters where. I exported in latin1 character set, ran this script above, and imported it just fine after that. I think I actually imported in latin1 charset as well, but it worked! Thanks again man.
  • Julio Vedovatto
    Julio Vedovatto almost 7 years
    Spent 3 hours looking for solution. Dumped DB and restored several times with no sucess. These 2 constants saved my day!