How to replace broken German Umlauts with MySQL REPLACE function?

mysql encoding utf-8 replace

10,487

Solution 1

To answer my own question, the linked possible duplicate pointed me into some direction but was not the right solution for me.

Instead, I did the following steps (using HeidiSQL):

Right-clicked the table in the tree.
Selected "Edit"
Changed the drop-down value of "Default collation" to "latin_german1_ci". Did not tick the "Convert data" checkbox.
Clicked the "Save" button at the bottom.
Changed the drop-down value of "Default collation" to "utf8_general_ci". Did not tick the "Convert data" checkbox.
Clicked the "Save" button at the bottom.

After that, I could successfully execute my statements like e.g.

UPDATE `mydb`.`mytable` SET `mycolumn` = REPLACE(`mycolumn`, 'Ã¼', 'ü');

Solution 2

Your 'Umlauts' are NOT broken! These are just encoded as utf-8 and so your 'u with two dots' results in a utf-8 2-byte sequence. There is nothing wrong about it. Your problem is that your viewer possibly displays in ANSI encoding (iso-8859-1) which results in a 1:1 relation between byte and character. However a viewer must "decode" utf-8 sequences to get the codepoint for the character, otherwise it will simply display the utf-8 start-byte as well as following-byte/s as the character it represents in ANSI. I bet your view is not configured to view utf-8 encoded text. Just configure it as appropriate and you're done.

Let your 'u with two dots' be utf-8 encoded, then every viewer worldwide can decode it's codepoint which is a unique unicode codepoint. Remember that if your 'u with two dots' is ANSI encoded, it will show possibly a different character when displayed with a different ANSI encoding, for example iso-8859-5.

Solution 3

IMHO, MySQL is correct in storing ü as Ã¼ using UTF-8 collation utf8_general_ci or better utf_unicode_ci. (Discussion on why to prefer utf8_unicode_ci)

Now, it matters how you decode your UTF-encoded SQL string before you print it on the frontend like HTML, Android or iOS.

HTML

In case of HTML, set utf-8 charset in <head> section of HTML page, German character ü will show correctly.

<head>
    <meta charset="UTF-8"/>
    ...
</head>

If doesn't, then in case of PHP wrap your variable in utf_decode() function before printing it.

Note: If you are not using PHP with MySQL, lookup for utf-8 decode function in your language of choice.

Android

In case of Android use

Html.fromHtml(String).toString();

iOS

In case of iOS use

(NSString *)stringByDecodingHTMLEntities;

10,487

Author by

Uwe Keim

German developer. Some of my apps: SharePoint Systemhaus Göppingen (zwischen Stuttgart und Ulm) Eigene Homepage erstellen Test Management Software Windows 10 Ereignisanzeige Very proud father of Felix (2012) and Ina (2014). Loves running, climbing and Indian food. Code Project member #234.

Updated on June 04, 2022

Comments

Uwe Keim almost 2 years
(I accidentially asked this question first on DBA but found out in the FAQ to ask SQL questions here on SO instead)

I'm trying to "repair" a MySQL database (UTF-8 collation) that was imported from an encoding-broken SQL text file.

I.e. the German umlauts are broken inside the database. What should read e.g. "ü" is displayed as "Ã¼".

Since I'm aware of the REPLACE function, I first tried the obvious:
```
SELECT * FROM `mydb`.`mytable` WHERE `mycolumn` LIKE '%Ã¼%';
```
This gave me the expected results.

Next I tried to replace them:
```
UPDATE `mydb`.`mytable` SET `mycolumn` = REPLACE(`mycolumn`, 'Ã¼', 'ü');
```
To my surprise, this resulted in zero rows affected.

So even knowing a lot about encoding, I still cannot figure out how to solve this (or whether it is possible at all).

My question:

How to use the REPLACE function to replace broken German umlauts?
Uwe Keim over 10 years

I'm the guy linking to Joel's Unicode article all the time, so I'm pretty sure that I know how this encoding stuff works ;-)
brighty over 10 years

Thanks for the link to Joel's Unicode article, it is really well explained, especially when it comes to the utf-8 encoding part. I'd appreciate if the article would become completed by explaining how codepoints bigger than 0xFFFF (characters beyond the Basic Multilingual Plane BMP) fit into wchar-arrays e. g. how this is done in UTF-16 (the lovely surrogates).