PHP Simple HTML DOM Parser find string

28,472

Solution 1

$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
$result = $x->evaluate("//text()[contains(.,'617.99')]/ancestor::*/@id");
$unique = null;
for($i = $result->length -1;$i >= 0 && $item = $result->item($i);$i--){
    if($x->query("//*[@id='".addslashes($item->value)."']")->length == 1){
        echo 'Unique ID is '.$item->value."\n";
            $unique = $item->value;
        break;
    }
}
if(is_null($unique)) echo 'no unique ID found';

Solution 2

$html = file_get_html('http://www.google.com/');

$eles = $html->find('*');
foreach($eles as $e) {
    if(strpos($e->innertext, 'theString') !== false) {
        echo $e->id;
    }
}

http://simplehtmldom.sourceforge.net/manual.htm

Solution 3

Just imagine that any tag has a "plaintext" attribute and use standart attribute selectors.

So, HTML:

<div id="div1">
  <span>London is the capital</span> of Great Britain
</div>
<div id="div2">
  <span>Washington is the capital</span> of the USA
</div>

can be imagined in mind as:

<div id="div1" plaintext="London is the capital  of Great Britain">
  <span plaintext="London is the capital ">London is the capital</span> of Great Britain
</div>
<div id="div2" plaintext="Washington is the capital  of the USA">
  <span plaintext="Washington is the capital ">Washington is the capital</span> of the USA
</div>

And PHP to resolve your task is just:

<?php
  $t = '
    <div id="div1">
      <span>London is the capital</span> of Great Britain
    </div>
    <div id="div2">
      <span>Washington is the capital</span> of the USA
    </div>';
  $html = str_get_html($t);
  $foo = $html->find('span[plaintext^=London]');
  echo "ID: " . $foo[0]->parent()->id; // div1
?>

(take in mind that "plaintext" for <span> tags is right-padded with a space symbol; this is default behaviour of Simple HTML DOM, defined by constant DEFAULT_SPAN_TEXT)

Solution 4

Got the answer. The entire example is a little long but it works. I also show the output.

The HTML for what we are going to look at:

<html>
<head>
<title>Simple HTML DOM - Find Text</title>
</head>
<body>
<h3>Simple HTML DOM - Find Text</h3>
<div id="first">
 <p>This is a paragraph inside of div 'first'.
   This paragraph does not have the text we are looking for.</p>
 <p>As a matter of fact this div does not have the text we are looking for</p>
</div>
<div id="second">
 <ul>
  <li>This is an unordered list.
  <li id="love1">We are looking for the following word love.
  <li>Does not contain the word.
 </ul>
 <p id="love2">This paragraph which is in div second contains the word love.</p>
</div>
<div id="third">
 <a id="love3" href="goes.nowhere.com">link to love site</a>
</div>
</body>
</html>

The PHP:

<?php
include_once('simple_html_dom.php');

function scraping_for_text($iUrl,$iText)
{
echo "iUrl=".$iUrl."<br />";
echo "iText=".$iText."<br />";

    // create HTML DOM
    $html = file_get_html($iUrl);

    // get text elements
    $aObj = $html->find('text');
    if (count($aObj) > 0)
    {
       echo "<h4>Found ".$iText."</h4>";
    }
    else
    {
       echo "<h4>No ".$iText." found"."</h4>";
    }
    foreach ($aObj as $key=>$oLove)
    {
      $plaintext = $oLove->plaintext;
      if (strpos($plaintext,$iText) !== FALSE)
      {
         echo $key.": text=".$plaintext."<br />"
              ."--- parent tag=".$oLove->parent()->tag."<br />"
              ."--- parent id=".$oLove->parent()->id."<br />";
      }
    }

    // clean up memory
    $html->clear();
    unset($html);

    return;
}

// -------------------------------------------------------------
// test it!

// user_agent header...
ini_set('user_agent', 'My-Application/2.5');

scraping_for_text("test_text.htm","love");
?>

The output:

iUrl=test_text.htm
iText=love
Found love
18: text=We are looking for the following word love.
--- parent tag=li
--- parent id=love1
21: text=This paragraph which is in div second contains the word love.
--- parent tag=p
--- parent id=love2
25: text=link to love site
--- parent tag=a
--- parent id=love3

That's all they wrote!!!!

Share:
28,472
Charlie
Author by

Charlie

Updated on July 17, 2022

Comments

  • Charlie
    Charlie almost 2 years

    I am using PHP simple DOM parser but it does not seem to have the functionality to search for text. I need to search for a string and find the parent id for it. Essentially the reverse of normal usage.

    Anyone know how?

  • drudge
    drudge about 13 years
    This is PHP's DOMDocument, not the SimpleHTMLDom Library as the OP stated he was using.
  • Wrikken
    Wrikken about 13 years
    Ack, missed that. Still can't get my head around people using that slow, slow thingamajig, but you're right, this isn't the answer the OP is looking for then.
  • Charlie
    Charlie about 13 years
    I tried this answer... but DOMDocument spat out lots of errors... It seems very picky about the html... but you are right simple html parser is a real memory hog. Is there anyway to get it to play better with poorly formated html?
  • Wrikken
    Wrikken about 13 years
    Sure there is, before loading, set $d->recover = true;$d->strictErrorChecking = false;, and of course, use loadHTML() instead of loadXML() for HTML. If you still get to much errors, which you cannot ignore (never display errors on production sites), you could set libxml_use_internal_errors(true); to handle them seperately from other PHP errors.
  • karim79
    karim79 about 13 years
    $e->id is the Simple DOM way to get the ID attribute. Perhaps try changing $eles = $html->find('*'); to $eles = $html->find('p, div'); or something.
  • Charlie
    Charlie about 13 years
    is it not getAttribute('id') ... I can't get it to work regardless :S
  • Wrikken
    Wrikken about 13 years
    Ack, wrapper is not what we want :). My bad, my XPath is a bit rusty, try //text()[contains(.,'617.99')]/parent::*/@id, seems to work here.
  • Charlie
    Charlie about 13 years
    works a treat... except for the warnings... is there anyway to check if that id is unique?
  • Wrikken
    Wrikken about 13 years
    Warnings can be disabled by either prepeding @ (@$d->loadHTML($html);, which is kinda evil, or using libxml_use_internal_errors(true);$d->loadHTML($html);libxml_‌​clear_errors(); (preferred IMHO). An id should be unique, but we all know it's sometimes not. You can check with $x->query("//*[@id='theid']")->length == 1 (for priceIncTaxSpan3047 it is, but look at the 50 Table_01's, no wonder DOMDocument protests :)
  • Charlie
    Charlie about 13 years
    What I am looking to achieve is if it is not unique then it finds the parents id too and it keeps doing that until it finds a unique selector.... this xpath code is complex! Can you give me one last bit of guidence :)
  • Wrikken
    Wrikken about 13 years
    Well, just this once :P Edited my answer. I think it can be solved without a loop in 1 XPath query, but gets bit out of scope, and is probably best served with a seperate question with the proper XPath tags, so you don't have to rely on rusty ol' me :P
  • Tom
    Tom over 11 years
    Great example. Would you know how to go from text, back to an element? I want to search by text and then find the nearest element. It's from an old table layout without any classes or IDs.
  • electroid
    electroid almost 9 years
    so far the best answer