Perhaps part of the reason we are disagreeing is that we have not agreed on the requirements. With this API there are two people involved:
All the contributors who maintain all the plugins that will need to feed documents to the search system (including me); and
the people who build and maintain the search system (you).
I think the requirement is to make it to make it as simple as possible for group 1. and I don't care how difficult it makes life for group 2. (I try to take a similar attitude when building the question engine, it is the people who write question types who are important.)
I agree and in fact my goal is to make it as simple as possible for module creators. I've simplified the API to only those 3 functions because of this exact reason. I though this requirement goes without saying but to make it official I've added it to wiki.
And, you seem to be asking plugin developers to implement some sort of asynchronous interator, and that is hard and yucky.
See the actual implementation for forum, I'm not sure how could it be any easier than this:
function forum_gs_iterator($from = 0) {
global $DB;
$sql = "SELECT id, modified FROM {forum_posts} WHERE modified > ? ORDER BY modified ASC";
return $DB->get_recordset_sql($sql, array($from));
}
The implementation for your questions would be:
function question_gs_iterator($from = 0) {
global $DB;
$sql = "SELECT id, timemodified AS modified FROM {question} WHERE modified > ? ORDER BY modified ASC";
return $DB->get_recordset_sql($sql, array($from));
}
By the way, sam explained above why iteration based on timestamp will not work. Did you get that? I can explain in more detail if you like.
That is correct - the this iterator will not be enough for the restored backups - do you have some more examples of what else may not work?
There is no need to add queuing documents one-by-one (that I'm trying to avod here as much as I can - and it would be required otherwise - please read on) as this can be solved in at least two other ways:
1. By doing full re-indexing.
2. By adding another, id iterator:
SELECT id FROM <table> WHERE id > ? ORDER BY id ASC
It could be in another API function but I would make it optional to implement - handling course restore with the data would probably cover < 1% of the use cases. But anyway - as long as module always generates incrementing IDs, the problem is solved.
Also, I did not say "queue all the documents in a Moodle site, then work thorugh the queue" as a solution for initial indexing." My proposal is, the search system manages this process (a bit at a time, on cron):
Mark all contexts as requiring indexing.
For the first contextid that requires indexing, queue all the documents from that contextid.
Process the queue until it is empty.
If there are more unindexed contexts, go to 2.
Typically, 2. will queue a few tens, hundreds or thousands of documents. That is all.
I think that for plugin authors 'Tell me all your documents from this contextid' is conceptually very simple to think about, and therefore easy to implement.
Here are the problems with this idea:
1. What is context? Think bigger then just a module, what would the context be for a list of Moodle users? In any way, by context you mean some sort of grouping.
2. By marking you mean queuing - so you need to implement a queue here which may not be necessary at all.
3. One "context" from the queue would need to be processed in one run. How will you guarantee that context is not too big? If 1 context = 1 forum then what happens if the forum is too big to be indexed at once? The indexed will never finish. Processing DB entires is quicks but think about a forum with a lot of word documents attached.
4. What happens when a forum post is updated? You will need to schedule it in another queue for a signle document - so we have 2 queues alread (read on why you have to queue it).
In terms of handing update, how will the search system know what index updates are requried by a post_created event? That is impossible unless you put a lot of forum-specific knowledge into the search engine, and if you do that, third-party plugins can't be searchable.
You have to have the plugins translating the changes in their data into CRUD actions on search_documents.
See point 8 from my previous post - it can be done without any forum-specific knowledge (except of the name of the event to listen to).
Again, for me, it seems conceptually much simpler if the plugin author has to write a few simple, synchonous search API calls in with the back-end code. For example, when a use makes a forum post
https://github.com/moodle/moodle/blob/master/mod/forum/post.php#L661
There is add_to_log, and updates of completion status, and pushing files to the files API. Why not one line to push the new data to the search system too. Forcing the search update code to be somewhere else just makes it harder to find and maintain. (Mind you, that post to forum code is pretty old and messy. Roll on Forum NG.)
Even with the call like this, you still need to queue the document and that's why I think this call is not needed at all. You may think that adding to the search index is always quick and can be synchronuosly but consider for example forum post with big word document attached. Processing it may take 1 minute, you wouldn't like a user to wait 1 minute to finish adding a post. Or think about PDF - normally extracting text from PDF is quick but if you want to do it properly and connect OCR utility then you're talking about minutes for processing. Even for a short forum entry, think about using remote Search Engine (e.g. SOLR) - a web service call must be made to update it, so you are adding 1 second to the user's wait time. On top of this add problems with locking - if search index is locked while being updated then we're escalating the problem. I think that in general everything possible should be done asynchronusly to make user's experience as good as possible.
Also, in the situation where we are still doing initial indexing, so half the contexts are indexed, and half not, then it is easy for the search system to just ignore any udpate requests that affect contexts that have not been indexed yet, while recording the changes to things that are already in the index.
That's a good idea and can be useful for any implementation - I will keep it in mind.
Anyway, I am sitting here thinking "How easy would it be for me to use the global search stuff in the question bank and the quiz", and that is why I don't like your API.
No matter what, you have to implement something like "mod_gs_get_documents" and "mod_page_gs_access" so I guess you don't like the iterator again - but its implemenation is extremaly simple - see above, or example for quiz:
function quiz_gs_iterator($from = 0) {
global $DB;
$sql = "SELECT id, timemodified AS modified FROM {quiz} WHERE modified > ? ORDER BY modified ASC";
return $DB->get_recordset_sql($sql, array($from));
}
C'mon - it can't get any simpler than this, plus for the core modules, I will write the API myself and send to the component maintainers for review & integration.
Thanks for the input so far - cheers,
Tomasz (Tomek) Muras