How to detect UTF-8 characters in a Latin1 encoded column - MySQL

mysql utf-8 character-encoding latin1

36,904

Solution 1

Character encoding, like time zones, is a constant source of problems.

What you can do is look for any "high-ASCII" characters as these are either LATIN1 accented characters or symbols, or the first of a UTF-8 multi-byte character. Telling the difference isn't going to be easy unless you cheat a bit.

To figure out what encoding is correct, you just SELECT two different versions and compare visually. Here's an example:

SELECT CONVERT(CONVERT(name USING BINARY) USING latin1) AS latin1, 
       CONVERT(CONVERT(name USING BINARY) USING utf8) AS utf8 
FROM users 
WHERE CONVERT(name USING BINARY) RLIKE CONCAT('[', UNHEX('80'), '-', UNHEX('FF'), ']')

This is made unusually complicated because the MySQL regexp engine seems to ignore things like \x80 and makes it necessary to use the UNHEX() method instead.

This produces results like this:

latin1                utf8
----------------------------------------
BjÃ¶rn                Björn

Solution 2

Since your question is not completely clear, let's assume some scenarios:

Hitherto wrong connection: You've been connecting to your database incorrectly using the latin1 encoding, but have stored UTF-8 data in the database (the encoding of the column is irrelevant in this case). This is the case I described here. In this case, it's easy to fix: Dump the database contents to a file through a latin1 connection. This will translate the incorrectly stored data into incorrectly correctly stored UTF-8, the way it has worked so far (read the aforelinked article for the gory details). You can then reimport the data into the database through a correctly set utf8 connection, and it will be stored as it should be.
Hitherto wrong column encoding: UTF-8 data was inserted into a latin1 column through a utf8 connection. In that case forget it, the data is gone. Any non-latin1 character should be replaced by a ?.
Hitherto everything fine, henceforth added support for UTF-8: You have Latin-1 data correctly stored in a latin1 column, inserted through a latin1 connection, but want to expand that to also allow UTF-8 data. In that case just change the column encoding to utf8. MySQL will convert the existing data for you. Then just make sure your database connection is set to utf8 when you insert UTF-8 data.

Solution 3

There is a script on github to help with this sort of a thing.

36,904

Author by

dinie

Updated on February 27, 2020

Comments

dinie about 4 years

I am about to undertake the tedious and gotcha-laden task of converting a database from Latin1 to UTF-8.

At this point I simply want to check what sort of data I have stored in my tables, as that will determine what approach I should use to convert the data.

Specifically, I want to check if I have UTF-8 characters in the Latin1 columns, what would be the best way to do this? If only a few rows are affected, then I can just fix this manually.

Option 1. Perform a MySQL dump and use Perl to search for UTF-8 characters?

Option 2. Use MySQL CHAR_LENGTH to find rows with multi-byte characters? e.g. SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name); Is this enough?

At the moment I have switched my Mysql client encoding to UTF-8.
tripleee about 12 years

If multiple clients have been adding data, and some of them assumed they should submit utf8, you would get an unholy mix which basically needs to be sorted out manually. That doesn't mean you cannot automate parts of the process, and in fact the majority of cases can probably be decided without human intervention.
Gromski about 12 years

True, but then you're really completely foobar'd. Before even attempting to answer that scenario, the OP would need to give a lot more information about the actual problem at hand.
dinie about 12 years

Apologies for a late response and vague initial question. Awarded this answer because it more or less helped me in detecting characters where the likely intention was a UTF8 character. Upvoted deceze's answer because it contained situations I have elsewhere in the database
Kasapo over 11 years

Awesome -- this little nugget helped me fix an issue where utf8 encoded data was inserted to a utf8 table but interpreted as latin1 because I entered it over the mysql CLI... funny though because since the system was set to UTF8 it looked fine when inputting and selecting (just not when being decoded and rendered on the associated website).
tadman over 11 years

Sometimes if you read and write data from two connections with precisely the same misconfiguration, it magically works. Two wrongs, on occasion, do make a right.
Synchro about 11 years

This script worked very nicely for me, and I contributed some improvements to make it much faster and more flexible. I've also got a branch which converts to MySQL's 'proper' utf8mb4 charset.
Synchro about 11 years

This answer contains a fairly long list of the likely bad combinations.
Karthik T over 9 years

This script worked.. still dont understand how it worked tho.. need to go through it sometime.. Moved from latin1 - utf8 almost painlessly, had to add mysql_set_charset("utf8"); for php to use it right after that.
James about 7 years

For case 1, the command that worked for me was mysqldump --default-character-set=latin1 -u user -p database. I then had to go into the dump file and change SET NAMES latin1 to utf8. Then reimport the dump file and all fixed.
dolmen over 6 years

The first byte of an UTF-8 encoded codepoint above the ASCII range is in range 0xC2-0xF4 (U+0080 starts with byte 0xC2; U+10FFFF starts with 0xF4). So the range in this answer could be more restrictive to reduce false positives.
Admin almost 6 years

The OP asked how to detect UTF-8 characters in Latin1 columns. AFAICT, the mysql-convert-latin1-to-utf8 script, at present, does not actually help with that. Instead, it has a user-modifiable $collationMap array specifying a set of key-value pairs of collations. For each column whose collation matches one of the keys, the script will blindly assume its contents are encoded with the charset corresponding to the value's collation. The script changes the column's collation (and, implicitly, character set) to the latter, while preserving the binary value of the contents.
Patrick James McDougle almost 6 years

Well, detection is kinda impossible. Someone jammed a square peg into a round hole and asked why it's not square when it comes back out...You can kinda tell because the data that comes out doesn't quite look right, but you have to look at it to know (or identify common characters sequences that when interpreted as UTF-8 are more likely than the sequence of latin characters).
Jack over 5 years

This also gets false positive results => latin1 "é" letter ( utf char : 195, ansi char : 233 ), is there a workaround with this ?