Matching everything between html <body> tags using PHP

13,022

Solution 1

You should not use regular expressions to parse HTML.

Your particular problem in this case is you need to add the DOTALL modifier so that the dot matches newlines.

preg_match('/<body>(.*)<\/body>/s', $content, $matches);

But seriously, use an HTML parser instead. There are so many ways that the above regular expression can break.

Solution 2

Don't try to process html with regular expressions! Use PHP's builtin parser instead:

$dom = new DOMDocument;
$dom->loadHTML($string);
$bodies = $dom->getElementsByTagName('body');
assert($bodies->length === 1);
$body = $bodies->item(0);
for ($i = 0; $i < $body->children->length; $i++) {
    $body->remove($body->children->item($i));
}
$string = $dom->saveHTML();

Solution 3

If for some reason you don't have DOMDocument installed, try this

Step 1. Download simple_html_dom

Step 2. Read the documentation about how to use its selectors

require_once("simple_html_dom.php");
$doc = new simple_html_dom();
$doc->load($someHtmlString);
$body = $doc->find("body")->innertext;
Share:
13,022
Elitmiar
Author by

Elitmiar

Updated on September 16, 2022

Comments

  • Elitmiar
    Elitmiar over 1 year

    I have a script that returns the following in a variable called $content

    <body>
    <p><span class=\"c-sc\">dgdfgdf</span></p>
    </body>
    

    I however need to place everything between the body tag inside an array called matches

    I do the following to match the stuff between the body tag

    preg_match('/<body>(.*)<\/body>/',$content,$matches);
    

    but the $mathces array is empty, how could I get it to return everything inside the body tag

  • Justin Johnson
    Justin Johnson about 11 years
    FYI: Back then it had a memory leak and would kill the entire request when dealing with large pages. Hopefully it's fixed by now.
  • Simon
    Simon about 10 years
    I know it's an old question and answer but, this is a much better answer than the accepted solution