PHP: UTF 8 characters encoding

25,298

Solution 1

Your page is being served as UTF-8 so I'd point my finger at the database.

Make sure the connection is in UTF-8 before any SELECTs or INSERTS - in MySQL:

SET NAMES "utf8"

Solution 2

Just a quick note about CURLOPT_ENCODING : it's the Accept-Encoding header, which is not the same at all as character encoding. Supported accept encodings are "identity", "deflate", and "gzip".

Solution 3

It may have to do with the XML prologue, which looks like this for that particular feed you linked to:

<?xml version="1.0" encoding="ISO-8859-1" ?>

As far as I know libxml, on which SimpleXML is based, looks for this kind of things. I'm not sure about XML files but I'm sure that with HTML strings it looks for META elements that specify the charset.

Try stripping the XML prologue (I solved a similar problem once by stripping the HTML META tags) and don't forget to utf8_encode() the data before feeding it to SimpleXMLElement.

Share:
25,298
Daniel Clark
Author by

Daniel Clark

Updated on January 04, 2020

Comments

  • Daniel Clark
    Daniel Clark over 4 years

    I am scraping a list of RSS feeds by using cURL, and then I am reading and parsing the RSS data with SimpleXML. The sorted data is then inserted into a mySQL database.

    However, as notice on http://dansays.co.uk/research/MNA/rss.php I am having several issues with characters not displaying correctly.

    Examples:

    âGuitar Hero: Van Halenâ Trailer And Tracklist Available
    
    NV 10/10/09 – Salt Lake City, UT 10/11/09 – Denver, CO 10/13/09 –
    

    I have tried using htmlentities and htmlspecialchars on the data before inserting them into the database, but it doesn't seem to help resolve issue.

    How could I possibly resolve this issue I am having?

    Thanks for any advices.

    Updated

    I've tried what Greg suggested, and the issue is still here...

    Here is the code I used to do SET NAMES in PDO:

    $dbh = new PDO($dbstring, $username, $password); 
    
    $dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION); 
    
    $dbh->query('SET NAMES "utf8"');
    

    I did a bit of echo'ing with the simplexml data before it is sorted and inserted into the database, and I now believe it is something to do with the cURL...

    Here is what I have for cURL:

    $ch = curl_init($url);
    
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
    
    curl_setopt($ch, CURLOPT_HEADER, 0);
    
    curl_setopt($ch, CURLOPT_ENCODING, 'UTF-8');
    
    $data = curl_exec($ch);
    
    curl_close($ch);
    
    $doc = new SimpleXmlElement($data, LIBXML_NOCDATA);
    

    Issue Resolved

    I had to set the content charset in the RSS/HTML page to "UTF-8" to resolve this issue. I guess this isn't a real fix as the char problems are still there in the raw data. Looking forward to proper support for it in PHP6!