DOMDocument in php

20,379

Solution 1

If you want to work with DOM you have to understand the concept. Everything in a DOM Document, including the DOMDocument, is a Node.

The DOMDocument is a hierarchical tree structure of nodes. It starts with a root node. That root node can have child nodes and all these child nodes can have child nodes on their own. Basically everything in a DOMDocument is a node type of some sort, be it elements, attributes or text content.

          HTML                               Legend: 
         /    \                              UPPERCASE = DOMElement
       HEAD  BODY                            lowercase = DOMAttr
      /          \                           "Quoted"  = DOMText
    TITLE        DIV - class - "header"
     |             \
"The Title"        H1
                    |
           "Welcome to Nodeville"

The diagram above shows a DOMDocument with some nodes. There is a root element (HTML) with two children (HEAD and BODY). The connecting lines are called axes. If you follow down the axis to the TITLE element, you will see that it has one DOMText leaf. This is important because it illustrates an often overlooked thing:

<title>The Title</title>

is not one, but two nodes. A DOMElement with a DOMText child. Likewise, this

<div class="header">

is really three nodes: the DOMElement with a DOMAttr holding a DOMText. Because all these inherit their properties and methods from DOMNode, it is essential to familiarize yourself with the DOMNode class.

In practise, this means the DIV you fetched is linked to all the other nodes in the document. You could go all the way to the root element or down to the leaves at any time. It's all there. You just have to query or traverse the document for the wanted information.

Whether you do that by iterating the childNodes of the DIV or use getElementByTagName() or XPath is up to you. You just have to understand that you are not working with raw HTML, but with nodes representing that entire HTML document.

If you need help with extracting specific information from the document, you need to clarify what information you want to fetch from it. For instance, you could ask how to fetch all the links from the table and then we could answer something like:

$div = $dom->getElementById('showContent');
foreach ($div->getElementsByTagName('a') as $link) 
{
    echo $dom->saveXML($link);
}

But unless you are more specific, we can only guess which nodes might be relevant.

If you need more examples and code snippets on how to work with DOM browse through my previous answers to related questions:

By now, there should be a snippet for every basic to medium UseCase you might have with DOM.

Solution 2

To create a parser you can use htmlDOM.

It is very simple easy to use DOM parser written in php. By using it you can easily fetch the contents of div tag.

For example, find all div tags which have attribute id with a value of text.

$ret = $html->find('div[id=text]');
Share:
20,379

Related videos on Youtube

Saikios
Author by

Saikios

merge me

Updated on July 09, 2022

Comments

  • Saikios
    Saikios almost 2 years

    I have just started reading documentation and examples about DOM, in order to crawl and parse the document.

    For example I have part of document shown below:

        <div id="showContent">
        <table>
        <tr>
            <td>
             Crap
            </td>
        </tr>
    <tr>
              <td width="172" valign="top"><a href="link"><img height="91" border="0" width="172" class="" src="img"></a></td>
              <td width="10">&nbsp;</td>
              <td valign="top"><table cellspacing="0" cellpadding="0" border="0">
                  <tbody><tr>
                    <td height="30"><a class="px11" href="link">title</a><a><br>
                        <span class="px10"></span>
                    </a></td>
                  </tr>
                  <tr>
                    <td><img height="1" width="580" src="crap"></td>
                  </tr>
                  <tr>
                    <td align="right">
                        <a href="link"><img height="16" border="0" width="65" src="/buy"></a>
                    </td>
                  </tr>
                  <tr>
                    <td valign="top" class="px10">
                        <p style="width: 500px;">description.</p>
                    </td>
                  </tr>
              </tbody></table></td>
            </tr>
        <tr>
            <td>
    Crap
            </td>
        </tr>
        <tr>
            <td>
             Crap
            </td>
        </tr>
        </table>
        </div>
    

    I'm trying to use the following code to get all the tr tags and analyze whether there is crap or information inside them:

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    
    $xpath = new DOMXPath($dom);
    
    
    $tags = $xpath->query('.//div[@id="showContent"]');
    foreach ($tags as $tag) {
        $string="";
        $string=trim($tag->nodeValue);
        if(strlen($string)>3) {
            echo $string;
            echo '<br>';
        }
    }
    

    However I'm getting just stripped string without the tags, for example:

    Crap
    
    Crap
    Title
    Description
    

    But I would like to get:

    <tr>
       <td>Crap</td>
    </tr>
    <tr>
       <a href="link">title</a>
    </tr>
    

    How to keep html nodes (tags)?

    • netcoder
      netcoder over 13 years
    • Gordon
      Gordon over 13 years
      Your XPath matches the div. To get the HTML you show, you'd have to use different XPath Query/Queries and then pass the results to echo $dom->save($node). Please clarify what you are trying to get.
    • Gordon
      Gordon over 13 years
      @netcoder innerHTML is not required here at all.
    • Saikios
      Saikios over 13 years
      @netcoder thanks, for the link
    • Saikios
      Saikios over 13 years
      @Gordon, I'm trying to get the info of a page and display it in another one, but the page is listing a lot of information inside a table, some is relevant some isn't, I have pictures, titles and descriptions which I want, and then styles, numbers that I don't care, but I want to get the html inside the div to analize the relevance of the data, with my code I get all the strings and I don't know if it was a div inside a td, raw td or something else (all the info is inside that big div)
  • Saikios
    Saikios over 13 years
    Thanks Gordon, I was needing something like this to learn how dom worked, but I don't think I could use it to crawl the information I was needing, because they don't follow any standards and don't have classes, id, or nothing like that, just tables :( the info was usefull anyway to learn how to use it =D
  • rdlowrey
    rdlowrey about 12 years
    +1 ... I've been looking for a map to Nodeville for the longest time!
  • metric152
    metric152 almost 10 years
    This did a much better job for me. I was working with a site that had really bad html. domdocument wasn't able to find the node I wanted. This library handles bad html far better.