How to load XML when PHP can't indicate the right encoding?

13,242

Solution 1

You've to convert your document into UTF-8, the easiest would be to use utf8_encode().

DOMdocument example:

$doc = new DOMDocument();
$content = utf8_encode(file_get_contents($url));
$doc->loadXML($content);

SimpleXML example:

$xmlInput = simplexml_load_string(utf8_encode(file_get_contents($url_or_file)));

If you don't know the current encoding, use mb_detect_encoding(), for example:

$content = utf8_encode(file_get_contents($url_or_file));
$encoding = mb_detect_encoding($content);
$doc = new DOMdocument();
$res = $doc->loadXML("<?xml encoding='$encoding'>" . $content);

Notes:

  • If encoding cannot be detected (function will return FALSE), you may try to force the encoding via utf8_encode().
  • If you're loading html code via $doc->loadHTML instead, you can still use XML header.

If you know the encoding, use iconv() to convert it:

$xml = iconv('ISO-8859-1' ,'UTF-8', $xmlInput)

Solution 2

You could edit the document ('pre-process it') to specify the encoding it is being delivered in adding an XML declaration. What that is, you'll have to ascertain yourself, of course. The DOM object should then parse it.

Example XML declaration:

<?xml version="1.0" encoding="UTF-8" ?>
Share:
13,242

Related videos on Youtube

Admin
Author by

Admin

Updated on April 17, 2022

Comments

  • Admin
    Admin about 2 years

    I'm trying to load an XML source from a remote location, so i have no control of the formatting. Unfortunately the XML file I'm trying to load has no encoding:

    <ROOT xmlns:sql="urn:schemas-microsoft-com:xml-sql"> <NODE> </NODE> </ROOT>
    

    When trying something like:

    $doc = new DOMDocument( );
    $doc->load(URI);
    

    I get:

    Input is not proper UTF-8, indicate encoding ! Bytes: 0xA3 0x38 0x2C 0x38
    

    Ive looked at ways to suppress this, but no luck. How should I load this so that I can use it with DOMDocument?

  • Rushyo
    Rushyo over 14 years
    Obviously it wasn't UTF-8, or this wouldn't have been a problem. I refer to the crucial word 'example'. FYI. Those codes do not automatically infer ISO-8859-1 either.