PHP DOMDocument failing to handle utf-8 characters (☆)

26,684

Solution 1

DOMDocument::loadHTML() expects a HTML string.

HTML uses the ISO-8859-1 encoding (ISO Latin Alphabet No. 1) as default per it's specs. That is since longer, see 6.1. The HTML Document Character Set. In reality that is more the default support for Windows-1252 in common webbrowsers.

I go back that far because PHP's DOMDocument is based on libxml and that brings the HTMLparser which is designed for HTML 4.0.

I'd say it's safe to assume then that you can load an ISO-8859-1 encoded string.

Your string is UTF-8 encoded. Turn all characters higher than 127 / h7F into HTML Entities and you're fine. If you don't want to do that your own, that is what mb_convert_encoding with the HTML-ENTITIES target encoding does:

  • Those characters that have named entities, will get the named entitiy. € -> €
  • The others get their numeric (decimal) entity, e.g. ☆ -> ☆

The following is a code example that makes the progress a bit more visible by using a callback function:

$html = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function($match) {
    list($utf8) = $match;
    $entity = mb_convert_encoding($utf8, 'HTML-ENTITIES', 'UTF-8');
    printf("%s -> %s\n", $utf8, $entity);
    return $entity;
}, $html);

This exemplary outputs for your string:

☆ -> ☆
☆ -> ☆
☆ -> ☆

Anyway, that's just for looking deeper into your string. You want to have it either converted into an encoding loadHTML can deal with. That can be done by converting all outside of US-ASCII into HTML Entities:

$us_ascii = mb_convert_encoding($utf_8, 'HTML-ENTITIES', 'UTF-8');

Take care that your input is actually UTF-8 encoded. If you have even mixed encodings (that can happen with some inputs) mb_convert_encoding can only handle one encoding per string. I already outlined above how to more specifically do string replacements with the help of regular expressions, so I leave further details for now.

The other alternative is to hint the encoding. This can be done in your case by modifying the document and adding a

<meta http-equiv="content-type" content="text/html; charset=utf-8">

which is a Content-Type specifying a charset. That is also best practice for HTML strings that are not available via a webserver (e.g. saved on disk or inside a string like in your example). The webserver normally set's that as the response header.

If you don't care the misplaced warnings, you can just add it in front of the string:

$dom = new DomDocument();
$dom->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$html);

Per the HTML 2.0 specs, elements that can only appear in the <head> section of a document, will be automatically placed there. This is what happens here, too. The output (pretty-print):

<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <meta charset="utf-8">
    <title>Test!</title>
  </head>
  <body>
    <h1>☆ Hello ☆ World ☆</h1>    
  </body>
</html>

Solution 2

There's a faster fix for that, after loading your html document in DOMDocument, you just set (or better said reset) the original encoding. Here's a sample code:

$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="UTF-8">' . $html);

foreach ($dom->childNodes as $item)
    if ($item->nodeType == XML_PI_NODE)
        $dom->removeChild($item);
$dom->encoding = 'UTF-8'; // reset original encoding

Solution 3

<?php
  header("Content-type: text/html; charset=utf-8");
  $html = <<<HTML
<!doctype html>
<html>
<head>
    <meta charset="utf-8">
    <title>Test!</title>
</head>
<body>
    <h1>☆ Hello ☆ World ☆</h1>
</body>
</html>
HTML;

  $html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
  $dom = new DOMDocument("1.0", "utf-8");
  $dom->loadHTML($html);

  header("Content-Type: text/html; charset=utf-8");
  echo($dom->saveHTML());

Output:

<!DOCTYPE html>
<html><head><meta charset="utf-8"><title>Test!</title></head><body>
    <h1>&#9734; Hello &#9734; World &#9734;</h1>
</body></html>
Share:
26,684
Greg
Author by

Greg

code. refactor. expunge. repeat.

Updated on March 31, 2021

Comments

  • Greg
    Greg about 3 years

    The webserver is serving responses with utf-8 encoding, all files are saved with utf-8 encoding, and everything I know of setting has been set to utf-8 encoding.

    Here's a quick program, to test if the output works:

    <?php
    $html = <<<HTML
    <!doctype html>
    <html>
    <head>
        <meta charset="utf-8">
        <title>Test!</title>
    </head>
    <body>
        <h1>☆ Hello ☆ World ☆</h1>
    </body>
    </html>
    HTML;
    
    $dom = new DOMDocument("1.0", "utf-8");
    $dom->loadHTML($html);
    
    header("Content-Type: text/html; charset=utf-8");
    echo($dom->saveHTML());
    

    The output of the program is:

    <!DOCTYPE html>
    <html><head><meta charset="utf-8"><title>Test!</title></head><body>
        <h1>&acirc;&#152;&#134; Hello &acirc;&#152;&#134; World &acirc;&#152;&#134;</h1>
    </body></html>
    

    Which renders as:

    ☆ Hello ☆ World ☆


    What could I be doing wrong? How much more specific do I have to be to tell the DOMDocument to handle utf-8 properly?

  • hakre
    hakre almost 12 years
    @powtac: These variant actually does not need that header line. All characters not part of us-ascii are entities here. Any browser on earth will always display this properly unless you specify a (wrong) encoding not sharing us-ascii. But just noting, it's not wrong either.
  • Aliweb
    Aliweb over 11 years
    @hakre : that was perfect ! you solved my serious problem and now I have no headaches!!
  • Nate
    Nate over 9 years
    +1 Great answer, but which method do you recommend -- using mb_convert_encoding() or prepending the meta tag in loadHTML()?
  • hakre
    hakre over 9 years
    @Nate: I would say it depends. I normally do not recommend mb_convert_encoding() but for this case I do somehow. However that's a detail of personal preference. And it still depends whether you want to do the conversion in it's own step or you just want to smash that into DOOMDocument::loadHTML() which leaks the meta element into the document. I don't know for example what will happen if that element already existed. I have never tested that to a save point, but it normally "just works" (tm). The different ways in the answer are more for explanation.
  • Moshe Shaham
    Moshe Shaham over 9 years
    for anyone using the alternative method, I suggest to check DeZeA's answer below, it worked better since it did not remove classes from the html tag
  • Moshe Shaham
    Moshe Shaham over 9 years
    This worked better than hakre's version of adding the meta tag because adding the meta removed classes from the html tag
  • DeZeA
    DeZeA over 7 years
    Hmm, might be.. I had the code in a txt with a bunch of usefull snippets. I don't claim that's some original stuff even though that's some pretty standard use of the DOMDocument class.