PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string

php utf-8 diacritics strtr

45,357

Solution 1

iconv("utf-8","ascii//TRANSLIT",$input);

Extended example

Solution 2

A little trick that doesn't require setting locales or having huge translation tables:

function Unaccent($string)
{
    if (strpos($string = htmlentities($string, ENT_QUOTES, 'UTF-8'), '&') !== false)
    {
        $string = html_entity_decode(preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|tilde|uml);~i', '$1', $string), ENT_QUOTES, 'UTF-8');
    }

    return $string;
}

The only requirement for it to work properly is to save your files in UTF-8 (as you should already).

Solution 3

you can also try this

$string = "Fóø Bår";
$transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $normalized = $transliterator->transliterate($string);

but you need to have http://php.net/manual/en/book.intl.php available

Solution 4

If you are using WordPress, you can use the built-in function remove_accents( $string )

https://codex.wordpress.org/Function_Reference/remove_accents

However I noticed a bug : it doesn’t work on a string with a single character.

Solution 5

Okay, found an obvious solution myself, but it's not the best concerning performance...

echo strtr(utf8_decode($input), 
           utf8_decode('ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
           'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');

View more solutions

45,357

Author by

Nishan

Professional Web Developer since 2001, amateur developer since 198x. Eating and breathing JavaScript and PHP in my day-to-day live, but have seen a lot in my 30+ years of code-juggling. Adobe Certified Expert - Adobe Analytics Developer

Updated on June 02, 2020

Comments

Nishan about 4 years
What I want to do is to remove all accents and umlauts from a string, turning "lärm" into "larm" or "andré" into "andre". What I tried to do was to utf8_decode the string and then use strtr on it, but since my source file is saved as UTF-8 file, I can't enter the ISO-8859-15 characters for all umlauts - the editor inserts the UTF-8 characters.

Obviously a solution for this would be to have an include that's an ISO-8859-15 file, but there must be a better way than to have another required include?
```
echo strtr(utf8_decode($input), 
           'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ',
           'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
```
UPDATE: Maybe I was a bit inaccurate with what I try to do: I do not actually want to remove the umlauts, but to replace them with their closest "one character ASCII" equivalent.
Nishan over 15 years

I had to add "setlocale(LC_ALL, 'en_US');" (sadly no locals for Germany seem to be available on my machine :( ), but then it works. Great! :)
spikey about 12 years

Why does this solution return "o for ö on my machine and on the examples in the php reference it returns oe?
Zebooka almost 12 years

This does not work for Cyrillic characters. They are converted to ? question marks instead.
laurent over 11 years

It's not the best in terms of performance and it also produces incorrect result. Letters like Œ, Æ, etc. should decompose to two letters, not to one.
Matt about 11 years

This bombs with a value of false and gives me a notice that illegal characters were encountered...
Michał Leon over 10 years

To spikey's comment: if you set your locale to de_*.UTF8 (de_DE.UTF8, de_CH.UTF8, etc.), then umlauts will be converted to *e (ü->ue). Set it to en_US..UTF8 to get the desired effect.
edditor almost 10 years

I have the same problem as spikey, setlocale stuff doesn't help also.
Piskvor left the building over 9 years

You have missed žščřďťňů, and that's just the ones I see on my keyboard. Whitelisting known characters is not the best solution.
PeerBr over 9 years

setlocale() depends on your operating system, is not thread-safe and wreaks havoc if you do it wrong (such as treating commas as periods in conversions). Either be careful (using LC_CTYPE instead of LC_ALL in this case) or stay away from it unless you know exactly what you're doing.
Nishan over 8 years

@this.lau_ As mentioned in the question: I'm looking for the closest "one character ASCII", so no - two letter decomposition would not be correct for my use case. One letter is correct for what I'm looking to do.
vinczemarton over 7 years

Works great for hungarian
Jose Manuel Abarca Rodríguez over 5 years

Use "ascii//translit//ignore" to prevent "illegal characters encountered" error.
Constantin Groß almost 3 years

If iconv() with ASCII//TRANSLIT doesn't work for you with German umlauts (ä/ö/ü => ae/oe/ue, despite setting setlocale() to a German utf8 locale, this answer to another question was the solution for me, using transliterator_transliterate() with de-ASCII supplied via the transliterator build string.
Vladan over 2 years

Despite not actually being an exact answer, I appreciate this answer as I'm using WordPress. So thanks! ;)