Best way to parse RSS/Atom feeds with PHP

160,020

Solution 1

Your other options include:

Solution 2

I've always used the SimpleXML functions built in to PHP to parse XML documents. It's one of the few generic parsers out there that has an intuitive structure to it, which makes it extremely easy to build a meaningful class for something specific like an RSS feed. Additionally, it will detect XML warnings and errors, and upon finding any you could simply run the source through something like HTML Tidy (as ceejayoz mentioned) to clean it up and attempt it again.

Consider this very rough, simple class using SimpleXML:

class BlogPost
{
    var $date;
    var $ts;
    var $link;

    var $title;
    var $text;
}

class BlogFeed
{
    var $posts = array();

    function __construct($file_or_url)
    {
        $file_or_url = $this->resolveFile($file_or_url);
        if (!($x = simplexml_load_file($file_or_url)))
            return;

        foreach ($x->channel->item as $item)
        {
            $post = new BlogPost();
            $post->date  = (string) $item->pubDate;
            $post->ts    = strtotime($item->pubDate);
            $post->link  = (string) $item->link;
            $post->title = (string) $item->title;
            $post->text  = (string) $item->description;

            // Create summary as a shortened body and remove images, 
            // extraneous line breaks, etc.
            $post->summary = $this->summarizeText($post->text);

            $this->posts[] = $post;
        }
    }

    private function resolveFile($file_or_url) {
        if (!preg_match('|^https?:|', $file_or_url))
            $feed_uri = $_SERVER['DOCUMENT_ROOT'] .'/shared/xml/'. $file_or_url;
        else
            $feed_uri = $file_or_url;

        return $feed_uri;
    }

    private function summarizeText($summary) {
        $summary = strip_tags($summary);

        // Truncate summary line to 100 characters
        $max_len = 100;
        if (strlen($summary) > $max_len)
            $summary = substr($summary, 0, $max_len) . '...';

        return $summary;
    }
}

Solution 3

With 4 lines, I import a rss to an array.

$feed = implode(file('http://yourdomains.com/feed.rss'));
$xml = simplexml_load_string($feed);
$json = json_encode($xml);
$array = json_decode($json,TRUE);

For a more complex solution

$feed = new DOMDocument();
 $feed->load('file.rss');
 $json = array();
 $json['title'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
 $json['description'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('description')->item(0)->firstChild->nodeValue;
 $json['link'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('link')->item(0)->firstChild->nodeValue;
 $items = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('item');

 $json['item'] = array();
 $i = 0;

 foreach($items as $key => $item) {
 $title = $item->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
 $description = $item->getElementsByTagName('description')->item(0)->firstChild->nodeValue;
 $pubDate = $item->getElementsByTagName('pubDate')->item(0)->firstChild->nodeValue;
 $guid = $item->getElementsByTagName('guid')->item(0)->firstChild->nodeValue;

 $json['item'][$key]['title'] = $title;
 $json['item'][$key]['description'] = $description;
 $json['item'][$key]['pubdate'] = $pubDate;
 $json['item'][$key]['guid'] = $guid; 
 }

echo json_encode($json);

Solution 4

I would like introduce simple script to parse RSS:

$i = 0; // counter
$url = "http://www.banki.ru/xml/news.rss"; // url to parse
$rss = simplexml_load_file($url); // XML parser

// RSS items loop

print '<h2><img style="vertical-align: middle;" src="'.$rss->channel->image->url.'" /> '.$rss->channel->title.'</h2>'; // channel title + img with src

foreach($rss->channel->item as $item) {
if ($i < 10) { // parse only 10 items
    print '<a href="'.$item->link.'">'.$item->title.'</a><br />';
}

$i++;
}

Solution 5

If feed isn't well-formed XML, you're supposed to reject it, no exceptions. You're entitled to call feed creator a bozo.

Otherwise you're paving way to mess that HTML ended up in.

Share:
160,020
carson
Author by

carson

I started developing software on a small Casio basic calculator in 1987. I graduated to C, then C++ and finally Java. Although I work daily with Java I have a wide base of experience in a lot of other languages and technologies as well. Reading my blog will give you a good idea of some of the more recent trails I've explored. I have created a few Stack Exchange related projects: Stack Exchange Firefox Plugin Stack Exchange Javascript Widget Stack Exchange Java library You can find some of my work other places as well: Github Twitter LinkedIn Google Analytics Wordpress Plugin Ruby Ming Gem

Updated on May 07, 2020

Comments

  • carson
    carson about 4 years

    I'm currently using Magpie RSS but it sometimes falls over when the RSS or Atom feed isn't well formed. Are there any other options for parsing RSS and Atom feeds with PHP?

  • Helen Neely
    Helen Neely over 14 years
    +1, you should not try to work around any XML that is not well-formed. We've had bad experiences with them, trust me, it was big pain :(
  • artur
    artur over 14 years
  • Talvi Watia
    Talvi Watia almost 14 years
    you have an end-tag with no start tag. ;)
  • Brian Cline
    Brian Cline almost 14 years
    Well, I had one, but it was being eaten by SO's code formatter since it had no empty line above it. On a related note, you did not start your sentence with a capital letter. ;)
  • Kevin Pastor
    Kevin Pastor about 13 years
    However, programmers do not get to choose business partners and have to parse what they are given.
  • duality_
    duality_ almost 13 years
    I don't like such "answers", giving links without any comments. Looks like you google it and link to a few top results. Especially since the asker has some RSS experience and needs a better parser.
  • Tim
    Tim over 12 years
    Please change $feed_uri = $feed_or_url; to $feed_uri = $file_or_url; ... other than that, thank you for this code! It works great!
  • András Szepesházi
    András Szepesházi almost 12 years
    Note that while this solution is great, it'll only parse RSS feeds in it's current form. Atom feeds will not be parsed due to their different schema.
  • ITS Alaska
    ITS Alaska about 11 years
    Note that eregi_replace is now deprecated and has been replaced with preg_replace as well as eregi with preg_match. Documentations can be found here and here respectively.
  • yPhil
    yPhil almost 11 years
    What if you're building an universal RSS/Atom feed reader ? If any ill-formed xml file can "mess" your HTML, who is the Bozo ? ;) Be liberal in what you receive.
  • vladkras
    vladkras over 10 years
    I don't understand what is cookHtmlSummarySoup() for? whay not use strip_tags()?
  • Brian Cline
    Brian Cline over 10 years
    @ITSAlaska Thanks for the reminder. I think even back when I posted this in 2008 it was old code. I've updated it with preg_match accordingly.
  • Brian Cline
    Brian Cline over 10 years
    @vladkras Good question. Not sure where that wacky method name came from, looks like someone here edited it. I much prefer a built-in, so I've updated this to use strip_tags(). Thanks for the tip.
  • samayo
    samayo over 10 years
    I just tried it. It does not give an array
  • PJunior
    PJunior over 10 years
    can u give me the rss feed that u are using?
  • andrewk
    andrewk about 10 years
    In case you're wondering. It looks like he's using a tumblr rss feed. Anytumblrsite.com/rss would give you the same output.
  • Raptor
    Raptor about 10 years
    In case somebody needs a little bit advice, Last RSS is the easiest among the three listed above. Only 1 file to "require", and can fetch the RSS within 5 lines, with a decent array output.
  • Guidouil
    Guidouil about 10 years
    Used the 4 lines, did a great job :) but then I rewrote the 1st line : $feed = file_get_contents('http://yourdomains.com/feed.rss'); might be less intensive than file + implode
  • Will Bowman
    Will Bowman almost 10 years
    cant say its "great" using gzinflate and base64_decode, typically disabled for security.
  • Will Bowman
    Will Bowman almost 10 years
    one line, $feed = json_decode(json_encode(simplexml_load_file('news.google.com‌​/?output=rss')), true);
  • Fluchtpunkt
    Fluchtpunkt about 9 years
    i really like the one-liner - was looking for something like that - what about error-handling?
  • gadelat
    gadelat about 7 years
  • noob
    noob about 7 years
    I've used two of them and LastRss seems not good enough providing a fully functional helper and SimplePie is a bit too complicated. I would like to try some others but comments to those libs are better for people to understand, not just links.
  • musicin3d
    musicin3d about 6 years
    Why on earth are we converting an object into an array???
  • John T
    John T over 4 years
    Clear and simple solution! Works nicely.
  • Sagive
    Sagive almost 4 years
    it's a dead link for marketing porpuses.
  • Srinivas08
    Srinivas08 over 3 years
    rather than using $xml = simplexml_load_string($feed), this works pretty simple, in printing the data too ...