Dan:
The database component is meant to provide the caching functionality. It would store the normalized version of the latest fetched version of each feed. The first call for a given feed would retrieve the feed and cache it. For a period afterwords every subsequent call would retrieve the cached version. After that period the cached version would be deleted and a further call would retrieve and cache a new copy of the feed. Unused feeds would be pruned by a
SQL DELETE query run every or every few calls.
To be clear, this results in two potential scenarios. In one:
- API is used to request a feed
- the cache is checked but isn't current or existant
- the feed is then retrieved, parsed, normalized, and returned
In two:
- the API is used to request a feed
- the cache is checked and a reasonably current cached version is found
- it is then returned.
One problem is how to handle users that have 20 RSS feeds subscribed that aren't also subscribed to by other people (and thus not cached). The page load would take as long as the 20 feeds retrieval time. This is when the just-in-time-with-centralized-cache method doesn't work as well. The solution to this is an open question.
Another open question is what the default time-to-live for cached feeds should be. Regardless of the initial value, it should definitely be a variable somewhere so it can be tweaked later, and so the initial value isn't of immediate critical importance. One really cool idea, that is probably beyond this project's scope, would be to determine that value dynamically for different feeds based on observed update frequency. A more practical approach would be to make the value something like two or three hours and tweak it later (around testing time) if this doesn't seem to work for users.
The differences between the URLs are envisioned as follows:
- feedurl is the URL as requested at the level of the API (Example: example.com/rss/)
- normalizedfeedurl would be a URL tweaked to abstract some of the minor differences that URLs for the same resource can have ("www" versus no subdomain, a trailing /, etc.) used as a match URL for querying the cache (Example: http://www.example.com/rss)
- siteurl is the URL of the site as retrieved from the feed (Example: http://www.example.com/)
It's an open question whether the normalized and non-normalized versions both need be stored (at least in the beginning for debugging). As well as whether that kind of normalization should even be done, and if so to what degree.
itemposition would be a 1, 2, 3...N count of the position of the item in the feed as retrieved. This would provide an additional method of sorting. The publication date is listed as an optional attribute in the
RSS 2.0 specification and so it makes sense to use both items, giving
itemdate priority. A SQL query where items are first sorted ascending by
itemposition, and then descending by
itemdate would achieve the desired effect for cached data. The same effect could be achieved with PHP's sort for retrieved data.
I'll update the spec to make all these ideas clearly expressed sometime tomorrow or the day after. In particular I've been doodling some flow charts to illustrate the proposed process, I'll transfer that to an image and post that up as well.
Thank you for the good questions. Please let me know what you think of all this.
- Chris