Comparing UTF-8 String

22,247

Solution 1

IMPORTANT

This answer is meant for situations where it's not possible to run/install the 'intl' extension, and only sorts strings by replacing accented characters to non-accented characters. To sort accented characters according to a specific locale, using a Collator is a better approach -- see the other answer to this question for more information.

Sorting by non-accented characters in PHP 5.2

You may try converting both strings to ASCII using iconv() and the //TRANSLIT option to get rid of accented characters;

$str1 = iconv('utf-8', 'ascii//TRANSLIT', $str1);

Then do the comparison

See the documentation here:

http://www.php.net/manual/en/function.iconv.php

[updated, in response to @Esailija's remark] I overlooked the problem of //TRANSLIT translating accented characters in unexpected ways. This problem is mentioned in this question: php iconv translit for removing accents: not working as excepted?

To make the 'iconv()' approach work, I've added a code sample below that strips all non-word characters from the resulting string using preg_replace().

<?php

setLocale(LC_ALL, 'fr_FR');

$names = array(
   'Zoey and another (word) ',
   'Émilie and another word',
   'Amber',
);


$converted = array();

foreach($names as $name) {
    $converted[] = preg_replace('#[^\w\s]+#', '', iconv('UTF-8', 'ASCII//TRANSLIT', $name));
}

sort($converted);

echo '<pre>'; print_r($converted);

// Array
// (
//     [0] => Amber
//     [1] => Emilie and another word
//     [2] => Zoey and another word 
// )

Solution 2

There is no native way to do this, however a PECL extension: http://php.net/manual/de/class.collator.php

$c = new Collator('fr_FR');
if ($c->compare('Émily', 'Zoey') < 0) { echo 'Émily < Zoey'; }

Solution 3

I recomend to use the usort function, to avoid modifying the values, and still compare them correctly.

Example:

<?php

setLocale(LC_ALL, 'fr_FR');

$names = [
   'Zoey and another (word) ',
   'Émilie and another word',
   'Amber'
];

function compare(string $a, string $b) {
    $a = preg_replace('#[^\w\s]+#', '', iconv('utf-8', 'ascii//TRANSLIT', $a));
    $b = preg_replace('#[^\w\s]+#', '', iconv('utf-8', 'ascii//TRANSLIT', $b));

    return strcmp($a, $b);
}

usort($names, 'compare');

echo '<pre>';
print_r($names);
echo '</pre>';

with result:

Array
(
    [0] => Amber
    [1] => Émilie and another word
    [2] => Zoey and another (word) 
)
Share:
22,247
poudigne
Author by

poudigne

I am full-time C# programmer with 10 years of experience. I do game developpement as hobby.

Updated on June 30, 2021

Comments

  • poudigne
    poudigne almost 3 years

    I'm trying to compare two string lets say Émilie and Zoey. Well 'E' comes before 'Z' but on the ASCII chart Z comes before É so a normal if ( str1 > str2 ) Won't work.

    I tried with if (strcmp(str1,str2) > 0) still don't work. So i'm looking into a native way to compare string with UTF-8 characters.

  • poudigne
    poudigne over 11 years
    Seems a good solution but the client's server is on php 5.2, well if there's no native way i'll rollback to the massive string replace solution :(
  • thaJeztah
    thaJeztah over 11 years
    Yup, iconv() is native. You may need to use set_locale() first, you'll find some examples in the comments below the documentation page
  • Fabian Schmengler
    Fabian Schmengler over 11 years
    Then better convert them instead of replacing characters manually, the iconv solution seems appropiate.
  • Esailija
    Esailija over 11 years
    @PLAudet -1 This is misleading because of the example. The result string is 'Emilie, so yes, it will appear before Z but it will also appear before A. Please use a collator from php.net/manual/en/intl.requirements.php
  • thaJeztah
    thaJeztah over 11 years
    @Esailija I stand corrected, you're right about the quote in front of Emilie, hadn't realized that //TRANSLIT did this. I agree that the 'Collator' approach is the official way to do it, but OP stated that he didn't have the option to use that. I added a fix to my answer by preg_replacing non-word characters from the string
  • Esailija
    Esailija over 11 years
    @thaJeztah only because he thinks it requires php 5.3, however it can be used with 5.2 which is why I linked the requirements page so he can read it more carefully this time ;P
  • thaJeztah
    thaJeztah over 11 years
    @Esailija I've added your information to the answer fab provided so that the OP is able to make a decision which approach to take. My edit is currently waiting for review.
  • nickohrn
    nickohrn almost 10 years
    Thanks for this. Worked for me!
  • Nux
    Nux over 2 years
    This is built-in as of PHP 5.3. So you can use e.g. $collator = collator_create('pl_PL'); and then use a compare function return collator_compare($collator, $a, $b);. php.net/manual/en/collator.compare.php