How to parse an xml sitemap using PHP Curl and individually load each url

16,195

Solution 1

You don't appear to have any value to hold the result of the foreach:

foreach ($xml->url as $url_list) {
    $url = $url_list->loc;
    echo $url;
}

Solution 2

You not need to use curl, use simplexml_load_file($sitemap_URL)... or use simplexml_load_string() with file_get_contents() with stream_context_create(), for something more complex than GET.

... And not need DOM traverse.

Parse as array with one line!

As http://www.sitemaps.org/protocol.html XML description, it is a simple tree with good array representation.

You can use a json XML reader,

$array = json_decode(json_encode(simplexml_load_file($sitemap_URL) ), TRUE);

So use eg. foreach($array['image:image'] as $r) to traverse it (check by var_dump($array))... see also oop5.iterations.

PS: you can also do a previous node selection by XPath at simplexml.

Share:
16,195
Hedley Phillips
Author by

Hedley Phillips

Updated on June 12, 2022

Comments

  • Hedley Phillips
    Hedley Phillips almost 2 years

    I am trying to write a script that will read a remote sitemap.xml and parse the url's within it, then load each one in turn to pre-cache them for faster browsing.

    The reason behind this: The system we are developing writes DITA XML to the browser on the fly and the first time a page is loaded the wait can be between 8-10 seconds. Subsequent loads after that can be as little as 1 second. Obviously for a better UX, pre-cached pages are a bonus.

    Every time we prepare a new publication on this server or perform any testing/patching, we have to clear the cache so the idea is to write a script that will parse through the sitemap and load each url.

    After doing a bit of reading I have decided that the best route is to use PHP & Curl. Whether this is a good idea or not I don't know. I'm more familier with Perl but neither PHP nor Perl are installed on the system at present so I thought it might be nice to dip my toes in the PHP pool.

    The code I have grabbed off "teh internets" so far reads the sitemap.xml and writes it to a xml file on our server as well as displaying it in the browser. As far as I can tell this is just dumping the entire file in one go?

    <?php
    $ver = "Sitemap Parser version 0.2";
    echo "<p><strong>". $ver . "</strong></p>";
    
    
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, 'http://ourdomain.com/sitemap.xml;jsessionid=1j1agloz5ke7l?id=1j1agloz5ke7l');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $xml = curl_exec ($ch);
    curl_close ($ch);
    if (@simplexml_load_string($xml)) {
        $fp = fopen('feed.xml', 'w');
        fwrite($fp, $xml);
        echo $xml;
        fclose($fp);
    }
    ?>
    

    Rather than dumping the entire document into a file or to the screen it would be better to traverse the xml structure and just grab the url I require.

    The xml is in this format:

    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9&#x9;http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
        <url>
            <loc>http://ourdomain.com:80/content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_4</loc>
            <lastmod>2011-03-31T11:25:01.984+01:00</lastmod>
            <changefreq>monthly</changefreq>
            <priority>1.0</priority>
        </url>
        <url>
            <loc>http://ourdomain.com:80/content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_9</loc>
            <lastmod>2011-03-31T11:25:04.734+01:00</lastmod>
            <changefreq>monthly</changefreq>
            <priority>1.0</priority>
        </url>
    

    I have tried using SimpleXML:

    curl_setopt($ch, CURLOPT_URL, 'http://onlineservices.letterpart.com/sitemap.xml;jsessionid=1j1agloz5ke7l?id=1j1agloz5ke7l');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $data = curl_exec ($ch);
    curl_close ($ch);
    
    $xml = new SimpleXMLElement($data);
    $url = $xml->url->loc;
    echo $url;
    

    and this printed the first url to the screen which was great news!

    http://ourdomain.com:80/content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_4

    My next step was to try and read all of the locs in the document so I tried:

    foreach ($xml->url) {
        $url = $xml->url->loc;
        echo $url;
    }
    

    hoping this would grab each loc within the url but it produced nothing and here I am stuck.

    Please could someone guide me towards grabbing the child of multiple parents and then the best way to load this page and cache it which i am assuming is a simple GET?

    I hope I have provided enough info. If I'm missing anything (apart from the ability to actually write PHP. please say ;-)

    Thanks.