php iconv translit for removing accents: not working as excepted?

22,869

Solution 1

I have this standard function to return valid url strings without the invalid url characters. The magic seems to be in the line after the //remove unwanted characters comment.

This is taken from the Symfony framework documentation: http://www.symfony-project.org/jobeet/1_4/Doctrine/en/08 which in turn is taken from http://php.vrana.cz/vytvoreni-pratelskeho-url.php but i don't speak Czech ;-)

function slugify($text)
{
  // replace non letter or digits by -
  $text = preg_replace('#[^\\pL\d]+#u', '-', $text);

  // trim
  $text = trim($text, '-');

  // transliterate
  if (function_exists('iconv'))
  {
    $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
  }

  // lowercase
  $text = strtolower($text);

  // remove unwanted characters
  $text = preg_replace('#[^-\w]+#', '', $text);

  if (empty($text))
  {
    return 'n-a';
  }

  return $text;
}

echo slugify('é'); // --> "e"

Solution 2

cf @tchrist, with INTL php extension

http://fr2.php.net/manual/en/book.intl.php

preg_replace('/\pM*/u','',normalizer_normalize( $mystring, Normalizer::FORM_D));

eéèêëiîïoöôuùûüaâäÅ Ἥ ŐǟǠ ǺƶƈƉųŪŧȬƀ␢ĦŁȽŦ ƀǖ becomes

eeeeeiiiooouuuuaaaA Η OaA AƶƈƉuUŧOƀ␢ĦŁȽŦ ƀu


As tchrist emphasises, not all unicode characters are considered decomposable:

extract from Unicode charts:

U0080.pdf

00CF Ï LATIN CAPITAL LETTER I WITH DIAERESIS

≡ 0049 I 0308 ¨

NB this symbol « ≡ » indicate an available decomposition

00D0 Ð LATIN CAPITAL LETTER ETH

→ 00F0 ð latin small letter eth

→ 0110 Đ latin capital letter d with stroke

→ 0189 Ɖ latin capital letter african d

no decomposition available, IMHO strangely (we could consider ASCII letter D as an acceptable equivalent).

U0100.pdf

0110 Đ LATIN CAPITAL LETTER D WITH STROKE

→ 00D0 Ð latin capital letter eth

→ 0111 đ latin small letter d with stroke

→ 0189 Ɖ latin capital letter african d

even stranger: this one is identified as LATIN CAPITAL LETTER D (with stroke), but not decomposable as such! Perhaps a cooler solution should be to get the unicode description of each char, and compare it with the description of each ascii char (and replace accordingly). Anyone? ;-]

cf http://unicode.org/Public/UNIDATA/UnicodeData.txt

Solution 3

It happen with me with pure iconv without php. The Trick was to set LANG environment value to en_US.UTF-8 (it was hu_HU.UTF-8 before, in my case). After it worked as expected.

Share:
22,869

Related videos on Youtube

dynamic
Author by

dynamic

__ _ ____/ /_ ______ ____ _____ ___ (_)____ / __ / / / / __ \/ __ `/ __ `__ \/ / ___/ / /_/ / /_/ / / / / /_/ / / / / / / / /__ \__,_/\__, /_/ /_/\__,_/_/ /_/ /_/_/\___/ /____/ avatar from http://www.pinterest.com/pin/504332858244739013/

Updated on July 09, 2022

Comments

  • dynamic
    dynamic almost 2 years

    consider this simple code:

    echo iconv('UTF-8', 'ASCII//TRANSLIT', 'è');
    

    it prints

     `e
    

    instead of just

     e
    

    do you know what I am doing wrong?


    nothing changed after adding setlocale

    setlocale(LC_COLLATE, 'en_US.utf8');
    echo iconv('UTF-8', 'ASCII//TRANSLIT', 'è');
    
    • Michał Leon
      Michał Leon over 10 years
      Ignore tchris, this is THE way to do it, I use it in practice. The only error you made is that the locale "subclass" is setlocale(LC_CTYPE, 'en_US.UTF-8'); -> LC_TYPE, not _COLLATE. Tschüss.
    • Scott
      Scott over 8 years
      I'm having this same problem - it is certainly not LC_TYPE... that generates an error (for me at least). I've tried LC_ALL (which is what everyone else says) - with no effect. I'm putting in the string CŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòó‌​ôõöøùúûüýÿ and getting CSOEZsoez"YyenuA'A^A~A"AAAECE'E^E"EI'I^I"ID~NO'O^O~O"OOU'U^U‌​"U'Yssa'a^a~a"aaaece‌​'e^e"ei'i^i"id~no'o^‌​o~o"oou'u^u"u'y"y
  • dynamic
    dynamic about 13 years
    same result as before with setlocale, (see first post)
  • dynamic
    dynamic about 13 years
    I know I could do a preg_replace like that after the transliterate by iconv... I only wanted to know if the behaviour descrived in my first post is standard or iconv can transliterate "better"
  • dynamic
    dynamic about 12 years
    Sorrry but why there are 2 backslash in the preg_replace? shouldn't be just [^\pL\d] ?
  • NullPointer
    NullPointer almost 11 years
    What about plƒtre francin string where f does not get converted?
  • dearsina
    dearsina almost 5 years
    This is the only one that worked for me, on vanilla PHP7.2.