How to force XPath to use UTF8?
Solution 1
If it is a fully fledged valid xhtml document you shouldn't use loadhtml() but load()/loadxml().
Given the example xhtml document
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>xhtml test</title>
</head>
<body>
<h1>A Table</h1>
<table>
<tr><th>A</th><th>O</th><th>U</th></tr>
<tr><td>Ä</td><td>Ö</td><td>Ü</td></tr>
<tr><td>ä</td><td>ö</td><td>ü</td></tr>
</table>
</body>
</html>
the script
<?php
$raw2 = 'test.html';
$dom = new DOMDocument();
$dom->load($raw2);
$xpath = new DOMXPath($dom);
var_dump($xpath->registerNamespace('h', 'http://www.w3.org/1999/xhtml'));
$query = '//h:td/text()';
$nodes = $xpath->query($query);
foreach($nodes as $node) {
foo($node->wholeText);
}
function foo($s) {
for($i=0; $i<strlen($s); $i++) {
printf('%02X ', ord($s[$i]));
}
echo "\n";
}
prints
bool(true)
C3 84
C3 96
C3 9C
C3 A4
C3 B6
C3 BC
i.e. the output/strings are utf-8 encoded
Solution 2
I had the same problem and I couldn't use tidy in my webserver. I found this solution and it worked fine:
$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"); $dom = new DomDocument(); $dom->loadHTML($html);
Solution 3
A bit late in the game, but perhaps it helps someone...
The problem might be in the output, and not in the dom/xpath object itself.
If you would output the nodeValue directly, you would get corrupted characters e.g.:
ìÂÂì ë¹Â디ì¤
ìì ë¹ë””ì¤ í°ì íì¤
You have to load your dom object with the second param "utf-8", new \DomDocument('1.0', 'utf-8')
, but still when you print the dom node list/element value you get broken characters:
echo $contentItem->item($index)->nodeValue
you have to wrap it up with utf8_decode:
echo utf8_decode($contentItem->item($index)->nodeValue)
//output: 者不終朝而會,愚者可浹旬而學
Solution 4
I have not tried, but the second parameter of DOMDocument::__construct
seems to be related to the encoding ; maybe that'll help you :-)
Else, there is an encoding property in DOMDocument, which is writable.
The DOMXpath beeing constructed with the DOMDocument as parameter, maybe it'll work...
Comments
-
Gordon over 1 year
I have an XHTML document being passed to a PHP app via Greasemonkey AJAX. The PHP app uses UTF8. If I output the POST content straight back to a textarea in the AJAX receiving div, everything is still properly encoded in UTF8.
When I try to parse using XPath
$dom = new DOMDocument(); $dom->loadHTML($raw2); $xpath = new DOMXPath($dom); $query = '//td/text()'; $nodes = $xpath->query($query); foreach($nodes as $node) { var_dump($node->wholeText); }
dumped strings are not utf8. How do I force DOM/XPath to use UTF8?
-
Gordon over 14 yearsThe page I'm parsing didn't have <?xml ?>. Used Tidy to add that and my problem is solved.
-
Gordon over 14 years
$dom->encoding = 'utf8'
had no effect, nor did setting the encoding in__construct()
. Possibly due to usingloadHTML()
, but I don't know. -
leticia over 11 yearsloadHTML() overrides the encoding set in constructor
-
Nabil Kadimi about 10 years+1'd, the only suggestion is to move the second line to the top, it was confusing (at least for me).
-
James Huckabone about 10 yearsI have been struggling on and off with this for over a year. Thank you so much for this. I've tried countless things that didn't work: included special classes, headers, metas, php.ini's, xml utf-8 hacks, and many more and nothing worked for my particular issue, except this.
-
VolkerK over 9 yearsThat is correct. I maintain the strong oppinion (weakly held): if it claims to be xhtml don't try to fix it; they wanted the x in front, they have to deliver. ;-)
-
Bhargav Rao over 7 yearsPlease don't add the same answer to multiple questions. Answer the best one and flag the rest as duplicates. See meta.stackexchange.com/questions/104227/…
-
user658182 over 6 yearsThis link is no longer valid. Can you update it or paste the solution from that page here?