Trouble with UTF-8 characters; what I see is not what I stored
Solution 1
This problem plagues the participants of this site, and many others.
You have listed the five main cases of CHARACTER SET
troubles.
Best Practice
Going forward, it is best to use CHARACTER SET utf8mb4
and COLLATION utf8mb4_unicode_520_ci
. (There is a newer version of the Unicode collation in the pipeline.)
utf8mb4
is a superset of utf8
in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.
Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4
, not utf8
.
I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
- Have your editor, etc. set to UTF-8.
- HTML forms should start like
<form accept-charset="UTF-8">
. - Have your bytes encoded as UTF-8.
- Establish UTF-8 as the encoding being used in the client.
- Have the column/table declared
CHARACTER SET utf8mb4
(Check withSHOW CREATE TABLE
.) -
<meta charset=UTF-8>
at the beginning of HTML - Stored Routines acquire the current charset/collation. They may need rebuilding.
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with SELECT
cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do
SELECT col, HEX(col) FROM tbl WHERE ...
The HEX for correctly stored UTF-8 will be
- For a blank space (in any language):
20
- For English:
4x
,5x
,6x
, or7x
- For most of Western Europe, accented letters should be
Cxyy
- Cyrillic, Hebrew, and Farsi/Arabic:
Dxyy
- Most of Asia:
Exyyzz
- Emoji and some of Chinese:
F0yyzzww
- More details
Specific causes and fixes of the problems seen
Truncated text (Se
for Señor
):
- The bytes to be stored are not encoded as utf8mb4. Fix this.
- Also, check that the connection during reading is UTF-8.
Black Diamonds with question marks (Se�or
for Señor
);
one of these cases exists:
Case 1 (original bytes were not UTF-8):
- The bytes to be stored are not encoded as utf8. Fix this.
- The connection (or
SET NAMES
) for theINSERT
and theSELECT
was not utf8/utf8mb4. Fix this. - Also, check that the column in the database is
CHARACTER SET utf8
(or utf8mb4).
Case 2 (original bytes were UTF-8):
- The connection (or
SET NAMES
) for theSELECT
was not utf8/utf8mb4. Fix this. - Also, check that the column in the database is
CHARACTER SET utf8
(or utf8mb4).
Black diamonds occur only when the browser is set to <meta charset=UTF-8>
.
Question Marks (regular ones, not black diamonds) (Se?or
for Señor
):
- The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
- The column in the database is not
CHARACTER SET utf8
(or utf8mb4). Fix this. (UseSHOW CREATE TABLE
.) - Also, check that the connection during reading is UTF-8.
Mojibake (Señor
for Señor
):
(This discussion also applies to Double Encoding, which is not necessarily visible.)
- The bytes to be stored need to be UTF-8-encoded. Fix this.
- The connection when
INSERTing
andSELECTing
text needs to specify utf8 or utf8mb4. Fix this. - The column needs to be declared
CHARACTER SET utf8
(or utf8mb4). Fix this. - HTML should start with
<meta charset=UTF-8>
.
If the data looks correct, but won't sort correctly, then either you have picked the wrong collation, or there is no collation that suits your need, or you have Double Encoding.
Double Encoding can be confirmed by doing the SELECT .. HEX ..
described above.
é should come back C3A9, but instead shows C383C2A9
The Emoji 👽 should come back F09F91BD, but comes back C3B0C5B8E28098C2BD
That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were Señor
.
Fixing the Data, where possible
For Truncation and Question Marks, the data is lost.
For Mojibake / Double Encoding, ...
For Black Diamonds, ...
The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
Solution 2
I had similar issues with two of my projects, after a server migration. After searching and trying a lot of solutions, I came across with this one:
mysqli_set_charset($con,"utf8");
After adding this line to my configuration file, everything works fine!
I found this solution for MySQLi—PHP mysqli set_charset() Function—when I was looking to solve an insert from an HTML query.
Solution 3
I was also searching for the same issue. It took me nearly one month to find the appropriate solution.
First of all, you will have to update you database will all the recent CHARACTER and COLLATION to utf8mb4 or at least which support UTF-8 data.
For Java:
while making a JDBC connection, add this to the connection URL useUnicode=yes&characterEncoding=UTF-8 as parameters and it will work.
For Python:
Before querying into the database, try enforcing this over the cursor
*
cursor.execute('SET NAMES utf8mb4')
cursor.execute("SET CHARACTER SET utf8mb4")
cursor.execute("SET character_set_connection=utf8mb4")
*
If it does not work, happy hunting for the right solution.
Solution 4
-
Set your code IDE language to UTF-8
-
Add <meta charset="utf-8"> to your webpage header where you collect data form.
-
Check your MySQL table definition looks like this:
CREATE TABLE your_table ( ... ) ENGINE=InnoDB DEFAULT CHARSET=utf8
-
If you are using PDO, make sure
$options = array(PDO::MYSQL_ATTR_INIT_COMMAND=>'SET NAMES utf8'); $dbL = new PDO($pdo, $user, $pass, $options);
If you already got a large database with above problem, you can try SIDU to export with correct charset, and import back with UTF-8.
Rick James
Using MySQL for over a decade, starting before version 4.0. Enjoy designing 'huge' databases. Hang out in forums.mysql.com (a lot more traffic there than here). Been programming since before the invention of Unix or SQL. My documents and tips: http://mysql.rjweb.org/
Updated on January 14, 2020Comments
-
Rick James over 4 years
I tried to use UTF-8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
????
instead of Asian characters. Even for European text, I gotSe?or
forSeñor
.- Strange gibberish (Mojibake?) such as
Señor
or新浪新闻
for新浪新闻
. - Black diamonds, such as Se�or.
- Finally, I got into a situation where the data was lost, or at least truncated:
Se
forSeñor
. - Even when I got text to look right, it did not sort correctly.
What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?
-
Rick James almost 8 yearsThe
DEFAULT CHARSET
for a table is just that, a default. It can, and sometimes should, be overridden on the column definition. -
Rick James almost 8 yearsPDO is better done with the charset option:
$db = new PDO('dblib:host=host;dbname=db;charset=UTF8', $user, $pwd);
(This is listed in the link to my 'charcoll' document.) -
SIDU almost 8 yearsYou are 20K more expert than me :) Yes you can set charset for a column. Try not to over-use it. End up with more management time. Similarly you can grant access to a certain column of a MySQL table. However you do not have to use it unless you do not have better alternative.
-
Rick James almost 8 yearsI advocate any hex string or truely fixed length string should, in most cases, be ascii. This is to override the 'new' default of utf8, which takes 3 times the space for
CHAR(n)
. (This has bitten some people who did not notice the system-wide default changing to utf8.) -
Rick James over 7 yearsNotepad's
ANSI
is probably closest to MySQL'slatin1
. The 0x93 in that link is“
and probably came from some place like Word. You could either convert to utf8 (hexE2809C
) or tell MySQL that the data islatin1
and hope that you don't trip up somewhere else. -
pramodpxi over 6 yearsIf i alter the database to utf8, should the database be restarted to take the changes? And is there any chances of data loss.
-
spencer7593 over 6 years@ppmakeitcount: no, the
ALTER DATABASE
statement does not require a restart of MySQL to take affect. However, changing the default characterset for a database does not affect any tables currently in the database; it only has an effect on new tables e.g.CREATE TABLE
which do not specify a default characterset for the table; that's when the database default characterset comes into play. (Similarly, changing the default characterset of the table does not affect columns already in the table; it only has an effect on columns added to the table, when no column characterset is specified. -
david_adler over 6 yearsIf the client, database and tables are in
utf8mb4
I seem to be able to store emojis fine. Some blogs suggest also settingcollation-server
andcharacter-set-server
in mysqld. Do I really need to changemysqld
what difference does server setting make? -
Rick James over 6 years@david_adler - There are multiple ways to get the effect of those settings. The best is to use the clients connection parameters. Second best by executing
SET NAMES utf8mb4
right after connecting. After all, this is declaring the encoding in the client. -
Rick James almost 6 yearsWith MySQL 8.0 (now released), the default is
utf8mb4
andutf8mb4_0900_ai_ci
. Most users should use them without considering the other charsets and collations. -
Rick James almost 6 years@ppmakeitcount - There is no need to restart the database. However, be sure to use the appropriate
ALTER
, as discussed here: mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases -
Rick James almost 6 years
-
Rick James over 5 yearsAnother note: If a
FUNCTION
orSTORED PROCEDURE
is involved, you may have not been using the desired charset when you created it.DROP
it,SET NAMES
; re-CREATE
it. -
Rick James about 5 yearsYes, that is one of several things that can cause character set problems. Note: that syntax is valid for PHP, not other app languages, and only if using
mysqli
, notPDO
. -
Rick James over 4 yearsIn reference to Stored Routines, see stackoverflow.com/a/57796046/1766831
-
Rick James over 4 years1 month? That was fast. It took me more than a year to formulate this Q&A. Java looks right.
SETs
are not the 'right' way for Python; see mysql.rjweb.org/doc.php/charcoll#python Many other languages are discussed elsewhere in that blog. -
Brendon Whateley about 4 yearsYou probably want to change your answer to UTF8mb4, MySQL only allocates 3-bytes per character in UTF8, so if users add any valid 4-byte utf8 characters, then you will get database errors and lose data. Also, you really need to set one of the newer UTF8mb4 collations like utf8mb4_0900_xx_yy where the xx sets accent handling and yy sets capitalization handling. TL;DR: charset=utfmb4 and collation=utf8mb4_0900_ai_ci
-
Rick James almost 4 yearsdbfiddle.uk was recently fixed; see Answers and Comments in dba.stackexchange.com/questions/269014/…
-
Ashish Bhatt almost 4 years@RickJames But this issue exist with Mysql-Python below 1.2.4, so the
SET
statements are basically a work around. -
bballdave025 over 2 yearsThanks for answering your own question, @RickJames, i.e. giving advice that you believe will be helpful. It's a very useful reference that I've put in my quick-reference section. I'm going to archive it as well, to make sure I can always access it. Nice work!
-
Rick James over 2 years@bballdave025 - Thanks. It took me a long time -- first to discover all the different cases, then to figure out what caused each one, then more time to figure out how to explain them succinctly.
-
Peter Mortensen about 2 yearsRe "you can try SIDU": That is presumably your product or project(?). I suggest consulting some meta questions regarding disclosure. It shouldn't be necessary to open a new question; it is probably already covered.
-
Peter Mortensen about 2 yearsHow is the content near "cursor.execute" supposed to be formatted? Separate lines for each one? Or something else? Is "*" literate or intended as italics formatting?