PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string

45,357

Solution 1

iconv("utf-8","ascii//TRANSLIT",$input);

Extended example

Solution 2

A little trick that doesn't require setting locales or having huge translation tables:

function Unaccent($string)
{
    if (strpos($string = htmlentities($string, ENT_QUOTES, 'UTF-8'), '&') !== false)
    {
        $string = html_entity_decode(preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|tilde|uml);~i', '$1', $string), ENT_QUOTES, 'UTF-8');
    }

    return $string;
}

The only requirement for it to work properly is to save your files in UTF-8 (as you should already).

Solution 3

you can also try this

$string = "Fóø Bår";
$transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $normalized = $transliterator->transliterate($string);

but you need to have http://php.net/manual/en/book.intl.php available

Solution 4

If you are using WordPress, you can use the built-in function remove_accents( $string )

https://codex.wordpress.org/Function_Reference/remove_accents

However I noticed a bug : it doesn’t work on a string with a single character.

Solution 5

Okay, found an obvious solution myself, but it's not the best concerning performance...

echo strtr(utf8_decode($input), 
           utf8_decode('ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
           'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
Share:
45,357
Nishan
Author by

Nishan

Professional Web Developer since 2001, amateur developer since 198x. Eating and breathing JavaScript and PHP in my day-to-day live, but have seen a lot in my 30+ years of code-juggling. Adobe Certified Expert - Adobe Analytics Developer

Updated on June 02, 2020

Comments

  • Nishan
    Nishan about 4 years

    What I want to do is to remove all accents and umlauts from a string, turning "lärm" into "larm" or "andré" into "andre". What I tried to do was to utf8_decode the string and then use strtr on it, but since my source file is saved as UTF-8 file, I can't enter the ISO-8859-15 characters for all umlauts - the editor inserts the UTF-8 characters.

    Obviously a solution for this would be to have an include that's an ISO-8859-15 file, but there must be a better way than to have another required include?

    echo strtr(utf8_decode($input), 
               'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ',
               'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
    

    UPDATE: Maybe I was a bit inaccurate with what I try to do: I do not actually want to remove the umlauts, but to replace them with their closest "one character ASCII" equivalent.

  • Nishan
    Nishan over 15 years
    I had to add "setlocale(LC_ALL, 'en_US');" (sadly no locals for Germany seem to be available on my machine :( ), but then it works. Great! :)
  • spikey
    spikey about 12 years
    Why does this solution return "o for ö on my machine and on the examples in the php reference it returns oe?
  • Zebooka
    Zebooka almost 12 years
    This does not work for Cyrillic characters. They are converted to ? question marks instead.
  • laurent
    laurent over 11 years
    It's not the best in terms of performance and it also produces incorrect result. Letters like Œ, Æ, etc. should decompose to two letters, not to one.
  • Matt
    Matt about 11 years
    This bombs with a value of false and gives me a notice that illegal characters were encountered...
  • Michał Leon
    Michał Leon over 10 years
    To spikey's comment: if you set your locale to de_*.UTF8 (de_DE.UTF8, de_CH.UTF8, etc.), then umlauts will be converted to *e (ü->ue). Set it to en_US..UTF8 to get the desired effect.
  • edditor
    edditor almost 10 years
    I have the same problem as spikey, setlocale stuff doesn't help also.
  • Piskvor left the building
    Piskvor left the building over 9 years
    You have missed žščřďťňů, and that's just the ones I see on my keyboard. Whitelisting known characters is not the best solution.
  • PeerBr
    PeerBr over 9 years
    setlocale() depends on your operating system, is not thread-safe and wreaks havoc if you do it wrong (such as treating commas as periods in conversions). Either be careful (using LC_CTYPE instead of LC_ALL in this case) or stay away from it unless you know exactly what you're doing.
  • Nishan
    Nishan over 8 years
    @this.lau_ As mentioned in the question: I'm looking for the closest "one character ASCII", so no - two letter decomposition would not be correct for my use case. One letter is correct for what I'm looking to do.
  • vinczemarton
    vinczemarton over 7 years
    Works great for hungarian
  • Jose Manuel Abarca Rodríguez
    Jose Manuel Abarca Rodríguez over 5 years
    Use "ascii//translit//ignore" to prevent "illegal characters encountered" error.
  • Constantin Groß
    Constantin Groß almost 3 years
    If iconv() with ASCII//TRANSLIT doesn't work for you with German umlauts (ä/ö/ü => ae/oe/ue, despite setting setlocale() to a German utf8 locale, this answer to another question was the solution for me, using transliterator_transliterate() with de-ASCII supplied via the transliterator build string.
  • Vladan
    Vladan over 2 years
    Despite not actually being an exact answer, I appreciate this answer as I'm using WordPress. So thanks! ;)