Using cURL and dom to scrape data with php
I wonder if your problem is in the line:
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), 'classname')]");
As it stands, this literally looks for nodes that belong to the class with the name 'classname' - where 'classname' is not a variable, it's the actual name. This looks like you might have copied an example from somewhere - or did you literally name your class that?
I imagine that the data you are looking may not be in such nodes. If you could post a short piece of the actual HTML you are trying to parse, it should be possible to do a better job guiding you to a solution.
As an example, I just made the following complete code (based on yours, but adding code to open the stackoverflow.com
home page, and changing 'classname'
to 'question'
, since there seemed to be a lot of classes with question
in the name, so I figured I should get a good harvest. I was not disappointed.
<?php
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "http://stackoverflow.com");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
//print_r($output);
$dom = new DOMDocument();
@$dom->loadHTML($output);
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), 'question')]");
print_r($nodes);
$tmp_dom = new DOMDocument();
foreach ($nodes as $node)
{
$tmp_dom->appendChild($tmp_dom->importNode($node,true));
}
$innerHTML.=trim($tmp_dom->saveHTML());
$buffdom = new DOMDocument();
@$buffdom->loadHTML($innerHTML);
# Iterate over all the <a> tags
foreach($buffdom->getElementsByTagName('a') as $link) {
# Show the <a href>
echo $link->nodeValue, PHP_EOL;
echo "<br />";
}
?>
Resulted in many many lines of output. Try it - the page is at http://www.floris.us/SO/scraper.php
(or paste the above code into a page of your own). You were very, very close!
NOTE - this doesn't produce all the output you want - you need to include other properties of the node, not just print out the nodeValue
, to get everything. But I figure you can take it from here (again, without actual samples of your HTML it's impossible for anyone else to get much further than this in helping you...)
Admin
Updated on June 04, 2022Comments
-
Admin almost 2 years
Hi i am using cURL to get data from a website i need to get multiple items but cannot get it by tag name or id. I have managed to put together some code that will get one item using a class name by passing it through a loop i then pass it through another loop to get the text from the element.
I have a few problems here the first is i can see there must be a more convenient way of doing this. The second i will need to get multiple elements and stack together ie title, desciption, tags and a url link.
# Create a DOM parser object and load HTML $dom = new DOMDocument(); $result = $dom->loadHTML($html); $finder = new DomXPath($dom); $nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), 'classname')]"); $tmp_dom = new DOMDocument(); foreach ($nodes as $node) { $tmp_dom->appendChild($tmp_dom->importNode($node,true)); } $innerHTML = trim($tmp_dom->saveHTML()); $buffdom = new DOMDocument(); $buffdom->loadHTML($innerHTML); # Iterate over all the <a> tags foreach ($buffdom->getElementsByTagName('a') as $link) { # Show the <a href> echo $link->nodeValue, "<br />", PHP_EOL; }
I want to stick with PHP only.