Request for comments: Feed aggregation library spec

Request for comments: Feed aggregation library spec

by Chris Zubak-Skees -
Number of replies: 8
Hi all,

I'm a GSOC student developing a RSS/Atom parsing and retrieval library. The idea being to abstract away all the details (retrieval, caching, parsing, normalization, etc.) to just a few functions. The proof of concept involves updating the RSS Block to use this new library.

I've developed a draft spec which outlines the project here: http://docs.moodle.org/en/Student_projects/Feed_aggregation_library

Since my target audience is developers and I'm new to Moodle development, I'd be interested in your feedback on all levels of the project. Some specific questions I'd love to get responses on are:
  • Is the design philosophy in line with Moodle's?
  • Would you use a library like this, and if so, how? If not, is there something I could change that would make it more useful to you?
  • What kind of data structure would be best to return results in? I'm thinking an instance of a class implementing a specialized linked list, is there something else you would reccomend?
  • Am I missing anything from the spec?
Thanks in advance,
Chris
Average of ratings: -
In reply to Chris Zubak-Skees

Re: Request for comments: Feed aggregation library spec

by Dan Poltawski -
Hi Chris,

Great work! I have a few questions:

* Its not immediately obvious to me is what the benefit of storing all of:

normalizedfeedurl
feedurl
siteurl

in feed_urls, can you explain that further?

* How would itemposition work and would itemdate not be used for positioning?


* How long do we store feed items in the database? Is the database just used to cache current items present in the RSS feed?

Over time a RSS feed with large amounts of content could grow quite large and if there are lots of feed_urls which are not used by parts of moodle but are continually checked for updates the overhead could also be an issue.
In reply to Dan Poltawski

Re: Request for comments: Feed aggregation library spec

by Chris Zubak-Skees -
Dan:

The database component is meant to provide the caching functionality. It would store the normalized version of the latest fetched version of each feed. The first call for a given feed would retrieve the feed and cache it. For a period afterwords every subsequent call would retrieve the cached version. After that period the cached version would be deleted and a further call would retrieve and cache a new copy of the feed. Unused feeds would be pruned by a SQL DELETE query run every or every few calls.

To be clear, this results in two potential scenarios. In one:
  1. API is used to request a feed
  2. the cache is checked but isn't current or existant
  3. the feed is then retrieved, parsed, normalized, and returned
In two:
  1. the API is used to request a feed
  2. the cache is checked and a reasonably current cached version is found
  3. it is then returned.

One problem is how to handle users that have 20 RSS feeds subscribed that aren't also subscribed to by other people (and thus not cached). The page load would take as long as the 20 feeds retrieval time. This is when the just-in-time-with-centralized-cache method doesn't work as well. The solution to this is an open question.

Another open question is what the default time-to-live for cached feeds should be. Regardless of the initial value, it should definitely be a variable somewhere so it can be tweaked later, and so the initial value isn't of immediate critical importance. One really cool idea, that is probably beyond this project's scope, would be to determine that value dynamically for different feeds based on observed update frequency. A more practical approach would be to make the value something like two or three hours and tweak it later (around testing time) if this doesn't seem to work for users.

The differences between the URLs are envisioned as follows:
  • feedurl is the URL as requested at the level of the API (Example: example.com/rss/)
  • normalizedfeedurl would be a URL tweaked to abstract some of the minor differences that URLs for the same resource can have ("www" versus no subdomain, a trailing /, etc.) used as a match URL for querying the cache (Example: http://www.example.com/rss)
  • siteurl is the URL of the site as retrieved from the feed (Example: http://www.example.com/)

It's an open question whether the normalized and non-normalized versions both need be stored (at least in the beginning for debugging). As well as whether that kind of normalization should even be done, and if so to what degree.

itemposition would be a 1, 2, 3...N count of the position of the item in the feed as retrieved. This would provide an additional method of sorting. The publication date is listed as an optional attribute in the RSS 2.0 specification and so it makes sense to use both items, giving itemdate priority. A SQL query where items are first sorted ascending by itemposition, and then descending by itemdate would achieve the desired effect for cached data. The same effect could be achieved with PHP's sort for retrieved data.

I'll update the spec to make all these ideas clearly expressed sometime tomorrow or the day after. In particular I've been doodling some flow charts to illustrate the proposed process, I'll transfer that to an image and post that up as well.

Thank you for the good questions. Please let me know what you think of all this.

- Chris
Average of ratings: Useful (1)
In reply to Chris Zubak-Skees

Re: Request for comments: Feed aggregation library spec

by Dan Poltawski -
Thanks for the clarifications Chris.

The magpie RSS library which we currently use utilises an on-disk cache of items, which is initially how i'd envisaged this being achieved. Mostly because if we can let the library take care of the caching for us, it significantly reduces the complexity and also (reduces the need for yet another db table). I think a real advantage of storing the items in the database is to allow additional capabilities to be added on top (e.g. tagging of feeds across moodle).

Its worth noting that you have cron at your disposal to help with preventing interfactive requests requiring remote feed fetches. So, in an ideal world (where cron is being run regularly), we can do fetching from remote sites in cron (which isn't slowing down page loads) and only serve cached content to interactive users. (The rss block currently does this).
In reply to Dan Poltawski

Re: Request for comments: Feed aggregation library spec

by Chris Zubak-Skees -
Magpie or SimplePie or other libraries like it are certainly under consideration for use as parsing libraries. Whichever library does the most, does it well, and meets the criteria that the spec articulates would be suitable. Certainly, if it handles caching and retrieval already then so much the better. The decision of where to cache (database or disk) would then depend on the library selected. If I end up writing it, I would much prefer the database. SQL queries give a lot of flexibility that flat files do not (at least, not without a lot of additional work).

There are merits to both using cron and not using cron. With cron you've shaved some time during user interaction and you've eliminated the case where somebody with a large number of feeds or slow loading feeds notices a slowdown. Without cron you know that somebody does in fact want this feed and now, which fixes the problem you articulated where a lot of feeds (potentially thousands) that aren't used are downloaded and stored every hour or so. You can mitigate both sets of problems through various techniques, but it's still a trade off.
In reply to Chris Zubak-Skees

Re: Request for comments: Feed aggregation library spec

by sam marshall -
Picture of Core developers Picture of Peer reviewers Picture of Plugin developers
I'm sorry I don't have time to examine this spec right now but, just for information, my employer (the Open University) are planning to release a new block called 'newsfeed' to contrib. [It's actually an update of an earlier block, but nobody wants to use that earlier one as it was written for Moodle 1.6, then badly hacked into 1.7-roles-system compliance, and the install process was a complete mess.]

This release has been delayed a bit due to backup/restore bugs - and we are planning to do further tidying up so we might not release it publicly until later in the year. (Btw our students and staff have been using it for the last 2 years! It does the job, just its Moodle-compliance is not everything that might be hoped for.)

Anyway this block is basically an alternative to the RSS block except (a) it has fewer bugs ;) and (b) you can also create 'internal' feeds where you post messages to them yourself directly in Moodle. As well as displaying the feed It generates its own for students [in Atom format], so that you can have e.g. an RSS feed generated by an internal behind-firewall system which then becomes a public Atom feed for students. And - this is where it's got something slightly to do with your spec - it can aggregate to include the feeds from any other instance of the block.

I think this is probably not directly relevant to your project and I am not suggesting that you should use any of our code (obviously you are welcome to but I don't think it will be useful) but if you are interested in it - maybe it might be useful to see the user interface or something - then please do get in touch.

You can email me at s.marshall (the domain is open.ac.uk). I can provide a prerelease version of the current build [which installs easily into Moodle 1.9]. It might take me a few days to get back to you depending on how many bugs I need to fix just to make it work properly in our standard moodle test server, as nobody's tried that yet smile But it's probably ok...

--sam
Average of ratings: Useful (1)
In reply to sam marshall

Re: Request for comments: Feed aggregation library spec

by Chris Zubak-Skees -
Sam:

Thanks for the heads up. I think you're right that these two projects aren't mutually exclusive. Will try and keep abreast of any developments with you folks so I'm not duplicating the wheel.

- Chris
In reply to Chris Zubak-Skees

Re: Request for comments: Feed aggregation library spec

by Bill Fitzgerald -
Hello, Chris,

You might want to check out some of the work that was done within the Drupal community on this last year as a SoC project -- many of the same questions were asked, and discussed in depth --

For the Drupal project, the end result was the FeedAPI -- it's an amazing piece of code, and includes pluggable parsers, and feed element mapping.

The FeedAPI module is here: http://drupal.org/project/feedapi

Two of the discussions that took place in the planning process that might be helpful:

http://groups.drupal.org/node/4624 and http://groups.drupal.org/node/4844

There is also a group that discusses issues related to rss and aggregation: http://groups.drupal.org/rss-aggregation
Average of ratings: Useful (1)