Parse XML with special characters (UTF-8)

15,819

This looks SimpleXML is creating a UTF-8 string, which is then rendered in ISO-8859-1 (latin-1) or something close like CP-1252.

When you save the result to a file and serve that file via a web server, the browser will use the encoding declared in the file.

Including in a web page
Since your web page encoding is not UTF-8, you need to convert the string to whatever encoding you are using, eg ISO-8859-1 (latin-1).

This is easily done with iconv():

    $xmlout = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $xmlout);

Saving to database
You database column is not using UTF-8 collation, so you should use iconv to convert the string to the charset that your database uses.

Assuming your database collation is the same as the encoding that you render in, you will not have to do anything when reading from the database.

Explanation
In UTF-8, a 0xc2 prefix byte is used to access the top half of the "Latin-1 Supplement" block which includes characters such as accented letters, currency symbols, fractions, superscript 2 and 3, the copyright and registered trademark symbols, and the non-breaking space.

However in ISO-8859-1, the byte 0xC2 represents an Â. So when your UTF-8 string is misinterpreted as one of those, then you get  followed by some other nonsense character.

Share:
15,819
Stomped
Author by

Stomped

Updated on October 28, 2022

Comments

  • Stomped
    Stomped over 1 year

    I'm starting out with some XML that looks like this (simplified):

    <?xml version="1.0" encoding="UTF-8"?>
    <alldata>
       <data name="Forsetì" />
    </alldata>
    </xml>
    

    But after I've parsed it with simplexml_load_string the special character (the i) becomes: ì which is obviously pretty mangled.

    Is there a way to prevent this from happening?

    I know for a fact the XML is fine, when saved as .txt and viewed in the browser the characters are fine. When I use simplexml_load_string on the XML and then save values as a text file, or to the database, its mangled.

  • Stomped
    Stomped about 14 years
    I know for a fact the XML is fine, when saved as .txt and viewed in the browser the characters are fine. When I use simplexml_load_string on the XML and then save values as a text file, or to the database, its mangled.
  • Alan Moore
    Alan Moore about 14 years
    The only characters that are required to be replaced with entities in XML are the basic five markup characters: ampersand, apostrophe, quotation mark, and the angle brackets. Others may need to be replaced if the document's encoding doesn't support them, but that's not an issue with UTF-8.