Using Xpath with PHP to parse HTML

34,980

Solution 1

My suggestion is to always use DOMDocument as opposed to SimpleXML, since it's a much nicer interface to work with and makes tasks a lot more intuitive.

The following example shows you how to load the HTML into the DOMDocument object and query the DOM using XPath. All you really need to do is find all td elements with a class name of topicViews and this will output each of the nodeValue members found in the DOMNodeList returned by this XPath query.

/* Use internal libxml errors -- turn on in production, off for debugging */
libxml_use_internal_errors(true);
/* Createa a new DomDocument object */
$dom = new DomDocument;
/* Load the HTML */
$dom->loadHTMLFile("https://forums.eveonline.com");
/* Create a new XPath object */
$xpath = new DomXPath($dom);
/* Query all <td> nodes containing specified class name */
$nodes = $xpath->query("//td[@class='topicViews']");
/* Set HTTP response header to plain text for debugging output */
header("Content-type: text/plain");
/* Traverse the DOMNodeList object to output each DomNode's nodeValue */
foreach ($nodes as $i => $node) {
    echo "Node($i): ", $node->nodeValue, "\n";
}

Solution 2

A double '/' will make xpath search. So if you would use the xpath '//table' you would get all tables. You can also use this deeper in your xpath structure like 'html/body/div/div/form//table' to get all tables under xpath 'html/body/div/div/form'.

This way you can make your code a bit more resilient against changes in the html source.

I do suggest learning a little about xpath if you want to use it. Copy paste only gets you so far.

A simple explanation about the syntax can be found at w3schools.com/xml/xpath_syntax.asp

Share:
34,980
VixenSoul
Author by

VixenSoul

Updated on November 22, 2020

Comments

  • VixenSoul
    VixenSoul over 3 years

    I'm currently trying to parse some data from a forum. Here is the code:

    $xml = simplexml_load_file('https://forums.eveonline.com');
    
    $names = $xml->xpath("html/body/div/div/form/div/div/div/div/div[*]/div/div/table//tr/td[@class='topicViews']");
    foreach($names as $name) 
    {
        echo $name . "<br/>";
    }
    

    Anyway, the problem is that I'm using google xpath extension to help me get the path, and I'm guessing that google is changing the html enough to make it not come up when i use my website to do this search. Is there some type of way I can make the host look at the site through google chrome so that it gets the right code? What would you suggest?

    Thanks!

    • Manuel Schweigert
      Manuel Schweigert over 11 years
      Did you try disabling Javascript in your webbrowser? Your PHP will not use it, hence any change done by javascript on the website will not be there on the server.
    • GolezTrol
      GolezTrol over 11 years
      XPath is for XML, not for HTML.
    • VixenSoul
      VixenSoul over 11 years
      JS isn't being run on the page I'm running this. I understand that XPath is for XML, but from what I've seen through Google searches, it's popular to use for HTML as well.
  • Akshay Bajpei
    Akshay Bajpei almost 4 years
    How can I get entire HTML tags (matching) tag, I don't need it like array. In my case i am using XPath as '//math' to select all math tag in html which later on I have to change with image