Solving UTF8 & french accents incompatibility

10,147

Solution 1

mysql_set_charset('utf8', $db_handle) tells the database that the data you're going to send it will be encoded in UTF-8. If the result is messed up, that means you did not in fact send UTF-8 encoded text. Double check the encoding of what you're sending.

I thought UTF8 does support characters like the french accents, but obviously it doesn't!

I does just fine.


See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text and Handling Unicode Front To Back In A Web App.

Solution 2

Is the PHP text in UTF-8? This concerns the encoding of the editor. When yes, then the bytes in the string literal should already be okay. It seems to be the case as Arabic is written too.

Use prepared statements for the SQL. This has several advantages: security (SQL injection), escaping of quotes and other special characters, and ... maybe ... encoding of the SQL string.

Unlikely: try

$s   = utf8_encode("je viens de télécharger et installer le logiciel");

Though I can foresee another problem: the definition of utf8_encode expects an ISO-8859-1 string, feasible for French, but not for Arabic. If this works, the encoding of the PHP is wrong somehow.

(I find Java to be more consistent w.r.t. Unicode, so I am not entirely sure for PHP.)

Solution 3

The issue of knowing the encoding and converting if necessary, can be addressed using something like this, which makes sure that coding is CP1252. Reverse this to make sure it is UTF8.

function conv_text($value) {
    $result = mb_detect_encoding($value." ","UTF-8,CP1252") == "UTF-8" ? iconv("UTF-8", "CP1252", $value ) : $value;
    return $result;
}
Share:
10,147
TheDude
Author by

TheDude

Updated on June 04, 2022

Comments

  • TheDude
    TheDude almost 2 years

    I have a PHP script which saves user content into a mysql database (PHP 5.4, mysql 5.5.31)

    All string-related fields in my database have utf8_unicode_ci as collation.

    My (simplified) code looks like this:

    $db_handle = mysql_connect('localhost', 'username', 'password');
    mysql_select_db('my_db');
    
    mysql_set_charset('utf8', $db_handle);
    
    // ------ INSERT: First example -------
    $s   = "je viens de télécharger et installer le logiciel";
    $sql = "INSERT INTO my_table (post_id, post_subject, post_text) VALUES (1, 'subject 1', '$s')";
    mysql_query($sql, $db_handle);
    
    // ------ INSERT: Second example -------
    $s   = "EPrints and العربية";
    $sql = "INSERT INTO my_table (post_id, post_subject, post_text) VALUES (2, 'subject 2', '$s')";
    mysql_query($sql, $db_handle);
    // ------------- 
    
    mysql_close($db_handle);
    

    The problem is, the first insert (latin text with the é accents) fails unless I comment this line:

    mysql_set_charset('utf8', $db_handle);
    

    But the second query (mix of latin & arabic content) will fail unless I call mysql_set_charset('utf8', $db_handle);

    I've been struggling with this for 2 days now. I thought UTF8 does support characters like the french accents, but obviously it doesn't!

    How can I fix this?

  • TheDude
    TheDude almost 11 years
    Thanks, the next question is, obvisouly, how to check the encoding of the input text. I googled a lot, tried several proposed solutions, it seems that this one does it
  • Gromski
    Gromski almost 11 years
    @TheDude No, that method is a bad fix which sometimes works for people who don't understand what they're doing. Don't be one of these people. Read my beforelinked article to understand encodings. It depends on where the string comes from. If it's simply hardcoded in a file, the encoding of the file as saved in the text editor determines its encoding.
  • TheDude
    TheDude almost 11 years
    Thanks, you're saying that I must detect encoding and then call iconv. But the question again is: how can I reliably do that? I ended up with this code, it seems to work, but (1) I'm not sure how reliable it is and (2) it doesn't support all/common charsets. Care to comment if this is enough for text coming from all different parts of the planet?
  • Gromski
    Gromski almost 11 years
    You cannot "detect" encodings reliably, by definition. You need to know what something is encoded in. If you have control over it, make sure it's in a known encoding. If you don't, rely in metadata that comes with the data (like HTTP headers). If you have neither, you're screwed and must resort to guessing.