How to find all element with PHP Simple HTML DOM Parser?
10,474
Solution 1
/**
* Refine the input HTML (string) and keep what was specified
*
* @param $string : Input HTML
* @param array $allowed : What will be kept?
* @return bool|simple_html_dom
*/
function crl_parse_html($string, $allowed = array())
{
// String --> DOM Elements
$string = str_get_html($string);
// Fetch child of the current element (one by one)
foreach ($string->find('*') as $child) {
if (
// Current inner-text contain one or more elements
preg_match('/<[^<]+?>/is', $child->innertext) and
// Current element tag is in maintained elements array
in_array($child->tag, $allowed)
) {
// Assign current inner-text to current filtered inner-text
$child->innertext = crl_parse_html($child->innertext, $allowed);
} else if (
// Current inner-text contain one or more elements
preg_match('/<[^<]+?>/is', $child->innertext) and
// Current element tag is NOT in maintained elements array
!in_array($child->tag, $allowed)
) {
// Assign current inner-text to the set of inner-elements (if exists)
$child->innertext = preg_replace('/(?<=^|>)[^><]+?(?=<|$)(<[^\/]+?>.+)/is', '$1', $child->innertext);
// Assign current outer-text to current filtered inner-text
$child->outertext = crl_parse_html($child->innertext, $allowed);
} else if (
(
// Current inner-text is only plaintext
preg_match('/(?<=^|>)[^><]+?(?=<|$)/is', $child->innertext) and
// Current element tag is NOT in maintained elements array
!in_array($child->tag, $allowed)
) or
// Current plain-text is empty
trim($child->plaintext) == ''
) {
// Assign current outer-text to empty string
$child->outertext = '';
}
}
return $string;
}
This is my solution, I made it, I just post here if someone need it and end this question.
Note that: this function uses recursive. So, too large data will be a big problem. Reconsider carefully when decide to use this function.
Solution 2
Your example appears to work fine, try the following, which will output the innertext of every element.
foreach($html->find('*') as $test)
echo $test->innertext;
For example:
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');
Outputs
HelloWorld
Comments
-
Manhhailua almost 2 years
// Find all element has attribute id $ret = $html->find('*[id]');
This is an example for finding all elements which have attribute id. Is there any way to find all elements. I try this way but it does not work:
// Find all element $ret = $html->find('*');
additional:
I want to fetch through all the elements in $html, all parents and childs elements will be fetched. Example:
<div> <span> <div>World!</div> <div> <span>Hello!</span> <span> <div>Hello World!</div> </span> </div> </span> </div>
Now I want to escape all
<span>
with their plaintext inside and keep all<div>
we have! Expected result:<div> <div>World!</div> <div> <div>Hello World!</div> </div> </div>
-
Manhhailua over 10 yearsWhat if the $html is
<div id="hello">Hello</div><div id="world">World<div>mama</div></div>
. I mean I want to fetch through all the element of $html, from parents to childs. -
Pez Cuckow over 10 yearsThat's not how accessing the DOM works, see my edit. Can you provide some HTML and your expected output. You'll need to access the DOM tree using methods such as
$html->children()
-
Manhhailua over 10 yearsI've added some details to the main question, you can take a look at it
-
Elias over 9 yearsExplaining what your function does step by step could help future S.O. members.