Another PHP XML parsing error: "Input is not proper UTF-8, indicate encoding!"

14,794

When ç is "ç", then your encoding is Windows-1252 (or maybe ISO-8859-1), but not UTF-8.

Share:
14,794
TechFanDan
Author by

TechFanDan

Eats, breathes anything tech related. #SOreadytohelp

Updated on June 04, 2022

Comments

  • TechFanDan
    TechFanDan almost 2 years

    Error:

    Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 3: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xE7 0x61 0x69 0x73

    XML from database (output from view source in FF):

    <?xml version="1.0" encoding="UTF-8" ?><audit><audit_detail>
        <fieldname>role_fra</fieldname>
        <old_value>Role en fran&#xe7;ais</old_value>
        <new_value>Role &#xe7; en fran&#xe7;ais</new_value>
    </audit_detail></audit></xml>
    

    If I understand correctly, the error is related to the first ç encoded in the old_value tag. To be precise, the error is related to this based on the bytes: "çais" ?

    Here's how I load the XML:

    $xmlData = simplexml_load_string($ed['updates'][$i]['audit_data']);
    

    The I loop through using this:

    foreach ($xmlData->audit_detail as $a){
    //code here
    }
    

    The field in the database is of data type text and is set utf8_general_ci.

    My function to create the audit_detail stubs:

    function ed_audit_node($field, $new, $old){
    
    
        $old = htmlentities($old, ENT_QUOTES, "UTF-8");
        $new = htmlentities($new, ENT_QUOTES, "UTF-8");
    
        $out = <<<EOF
            <audit_detail>
                <fieldname>{$field}</fieldname>
                <old_value>{$old}</old_value>
                <new_value>{$new}</new_value>
            </audit_detail>
    EOF;
        return $out;
    }
    

    The insert in the database is done like this:

    function ed_audit_insert($ed, $xml){
        global $visitor;
    
        $sql = <<<EOF
        INSERT INTO ed.audit
        (employee_id, audit_date, audit_action, audit_data, user_id) 
        VALUES (
            {$ed[emp][employee_id]}, 
            now(), 
            '{$ed[audit_action]}', 
            '{$xml}', 
            {$visitor[user_id]}
        );      
    EOF;
        $req = mysql_query($sql,$ed['db']) or die(db_query_error($sql,mysql_error(),__FUNCTION__));
    
    }
    

    The weirdest part is that the following works (without the xml declaration though) in a simple PHP file:

    $testxml = <<<EOF
    <audit><audit_detail>
            <fieldname>role_fra</fieldname>
            <old_value>Role en fran&#xe7;ais</old_value>
            <new_value>Role &#xe7; en fran&#xe7;ais</new_value>
        </audit_detail></audit>
    EOF;
    

    $xmlData = simplexml_load_string($testxml);

    Can someone help shed some light on this?

    Edit #1 - I'm now using DOM to build the XML document and have gotten rid of the error. Function here:

    $dom = new DomDocument();
    $root = $dom->appendChild($dom->createElement('audit'));
    $xmlCount = 0;
    
    if($role_fra != $curr['role']['role_fra']){
       $root->appendChild(ed_audit_node($dom, 'role_fra', $role_fra, $curr['role']['role_fra'])); 
       $xmlCount++;
    }
    
    ...
    
    function ed_audit_node($dom, $field, $new, $old){
    
        //create audit_detail node
        $ad = $dom->createElement('audit_detail');
    
        $fn = $dom->createElement('fieldname');
        $fn->appendChild($dom->createTextNode($field));
        $ad->appendChild($fn);
    
        $ov = $dom->createElement('old_value');
        $ov->appendChild($dom->createTextNode($old));
        $ad->appendChild($ov);
    
        $nv = $dom->createElement('new_value');
        $nv->appendChild($dom->createTextNode($new));
        $ad->appendChild($nv);
    
        //append to document
        return $ad;
    }
    
    if($xmlCount != 0){
        ed_audit_insert($ed,$dom->saveXML());   
    }
    

    However, I think I now have a display problem as this text "Roééleç sé en franêais" (new_value) is being displayed as:

    display problem:

    In my HTML document, I have the following declaration for content-type (unfortunately, I don't hold the keys to make changes here):

    <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
    ...
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
    

    I've tried iconv() to convert to ISO-8859-1, however, most of the special characters are being removed when doing the conversion. All that remains is "Ro" using this command:

    iconv('UTF-8','ISO-8859-1',$node->new_value);
    

    iconv output:

    The field in the db is: utf8_general_ci. However, the connection charset would be whatever is the default.

    Not quite sure where to go from here...

    Edit #2 - I tried utf8_decode to see if that wouldn't help, but it didn't.

    utf8_decode($a->new_value);
    

    Output:

    I also noticed that my field in the db did contain UTF-8. Which is good.