Converting mysql tables from latin1 to utf8

12,689

Solution 1

The reason why the strings 'Drü' and 'Dru' evaluate as the same is that in the utf8_general_ci collation, they count as "the same". The purpose of a collation for a character set is to provide a set of rules as to when strings are the same, when one sorts before the other, and so on.

If you want a different set of comparison rules, you need to choose a different collation. You can see the available collations for the utf8 character set by issuing SHOW COLLATION LIKE 'utf8%'. There are a bunch of collations intended for text that is mostly in a specific language; there is also the utf8_bin collation which compares all strings as binary strings (i.e. compares them as sequences of 0s and 1s).

Solution 2

UTF8_GENERAL_CI is accent insensitive.

Use UTF8_BIN or a language-specific collation.

Share:
12,689
Brent Baisley
Author by

Brent Baisley

Tech Lead for Enterprise Accounts group for stock photography site shutterstock.com. Engineering Lead for bigstockphoto.com, footage.shutterstock.com, Shutterstock enterprise accounts, and new business. Creator of Basecoat PHP framework. https://github.com/shutterstock/basecoat Created and manage Someecards Facebook application https://apps.facebook.com/someecards/ for someecards.com. Created and manage My Countdowns Facebook application. https://apps.facebook.com/countdown-timer/ Experienced PHP/Javascript/CSS/HTML developer, MySQL DBA, and Linux admin. Over 5 years of following Agile/Scrum development methodology.

Updated on June 08, 2022

Comments

  • Brent Baisley
    Brent Baisley almost 2 years

    I'm trying to convert some mysql tables from latin1 to utf8. I'm using the following command, which seems to mostly work.

    ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
    

    However, on one table I get an error about a duplicate key entry. This is caused by a unique index on a "name" field. It seems when converting to utf8, any "special" characters are indexed as their straight english equivalent. For example, there is already a record with a name field value of "Dru". When converting to utf8, a record with "Drü" is considered a duplicate. The same with "Patrick" and "Påtrìçk".

    Here is how to reproduce the issue:

    CREATE TABLE `example` (   `name` char(20) CHARACTER SET latin1 NOT NULL,
      PRIMARY KEY (`name`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
    
    INSERT INTO example (name) VALUES ('Drü'),('Dru'),('Patrick'),('Påtrìçk');
    
    ALTER TABLE example convert to character set utf8 collate utf8_general_ci;
    ERROR 1062 (23000): Duplicate entry 'Dru' for key 1