Matching everything between html <body> tags using PHP
Solution 1
You should not use regular expressions to parse HTML.
Your particular problem in this case is you need to add the DOTALL modifier so that the dot matches newlines.
preg_match('/<body>(.*)<\/body>/s', $content, $matches);
But seriously, use an HTML parser instead. There are so many ways that the above regular expression can break.
Solution 2
Don't try to process html with regular expressions! Use PHP's builtin parser instead:
$dom = new DOMDocument;
$dom->loadHTML($string);
$bodies = $dom->getElementsByTagName('body');
assert($bodies->length === 1);
$body = $bodies->item(0);
for ($i = 0; $i < $body->children->length; $i++) {
$body->remove($body->children->item($i));
}
$string = $dom->saveHTML();
Solution 3
If for some reason you don't have DOMDocument installed, try this
Step 1. Download simple_html_dom
Step 2. Read the documentation about how to use its selectors
require_once("simple_html_dom.php");
$doc = new simple_html_dom();
$doc->load($someHtmlString);
$body = $doc->find("body")->innertext;
Elitmiar
Updated on September 16, 2022Comments
-
Elitmiar over 1 year
I have a script that returns the following in a variable called $content
<body> <p><span class=\"c-sc\">dgdfgdf</span></p> </body>
I however need to place everything between the body tag inside an array called matches
I do the following to match the stuff between the body tag
preg_match('/<body>(.*)<\/body>/',$content,$matches);
but the $mathces array is empty, how could I get it to return everything inside the body tag
-
Justin Johnson about 11 yearsFYI: Back then it had a memory leak and would kill the entire request when dealing with large pages. Hopefully it's fixed by now.
-
Simon about 10 yearsI know it's an old question and answer but, this is a much better answer than the accepted solution