PHP DOMDocument - get html source of BODY
Solution 1
IN your case, you do not want to work with an HTML document, but with an HTML fragment -- a portion of HTML code ;; which means DOMDocument is not quite what you need.
Instead, I would rather use something like HTMLPurifier (quoting) :
HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications.
And, if you try your portion of code :
<div><p>Hello World
Using the demo page of HTMLPurifier, you get this clean HTML as an output :
<div><p>Hello World</p></div>
Much better, isn't it ? ;-)
(Note that HTMLPurfier suppots a wide range of options, and that taking a look at its documentation might not hurt)
Solution 2
The quick solution to your problem is to use an xPath expression to grab the body.
$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
echo($dom->saveXml($body->item(0)));
A word of warning here. Sometimes loadHTML will throw a warning when it encounters certainly poorly formed HTML documents. If you're parsing those kind of HTML documents, you'll need to find a better html parser [self link warning].
Solution 3
Faced with the same problem, I've created a wrapper around DOMDocument called SmartDOMDocument to overcome this and some other shortcomings (such as encoding problems).
You can find it here: http://beerpla.net/projects/smartdomdocument
Comments
-
leepowers almost 2 years
I'm using PHP's DOMDocument to parse and normalize user-submitted HTML using the
loadHTML
method to parse the content then getting a well-formed result viasaveHTML
:$dom= new DOMDocument(); $dom->loadHTML('<div><p>Hello World'); $well_formed= $dom->saveHTML(); echo($well_formed);
This does a beautiful job of parsing the fragment and adding the appropriate closing tags. The problem is that I'm also getting a bunch of tags I don't want such as
<!DOCTYPE>
,<html>
,<head>
and<body>
. I understand that every well-formed HTML document needs these tags, but the HTML fragment I'm normalizing is going to be inserted into an existing valid document.