Global Search rewrite

Global Search rewrite

by Tomasz Muras -
Number of replies: 49
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

Hi,

I'm currently working on a new version of the Global Search for Moodle. It's on the agenda for the November dev meeting but maybe you would like to review it before that. I would appreciate any feedback.

I've started documentation on the wiki and working POC code is on github.

There is still a lot of work to be done there - especially on the modules side. I would create git commits for most of the core modules and also need some minor bug fixes and updates to the core Moodle - but end result should be very similar to what I've described on the wiki.

One of my main motives for the re-write is scalability. A feature like this is most helpful on the big/huge Moodle installations, so it has to scale up.

My plan is to have new Global Search and core modules supporting it integrated with Moodle 2.3 release.

cheers,
Tomasz (Tomek) Muras

Average of ratings: Useful (1)
In reply to Tomasz Muras

Re: Global Search rewrite

by Tim Hunt -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

In terms of the API that modules have to implement, have you looked at the API that ousearch requires modules to implement. https://github.com/moodleou/moodle-local_ousearch/blob/master/doc/usage.html

You may not want to use the ousearch internals, but that API has worked successfully over a number of years, so it is probably worth taking a look.

In reply to Tim Hunt

Re: Global Search rewrite

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

Do you ever sleep Tim? smile

The API is pretty similar - I imagine that, after all it would be difficult to come up with something completely different. I see that you're synchroneusly pushing data from the modules into the search index, while I'd like to do the reverse (indexing run from cron an querying modules for new data). If you were to implement it again today - would you use Moodle events?

I was thinking about using events for updating index (e.g. when document is deleted) but hopefully there will be no need for this - so modules won't need to implement triggering new events, just 3 simple API functions.

I see that you're handling security in the search itself? I would like to push the decision about a document being available for the current user to the module. In some more complex cases only a module will be able to say if access should be granted or not.

I would also like to abstract the search engine from the "modules" altogether. For instance, a search engine like lucene could nicely handle other searches like "user searches". All user data (including custom profile fields) could be compiled into search index and admins could get a new capability of searching for users. The same could be done for course search, blocks, etc.

Tomasz (Tomek) Muras

In reply to Tomasz Muras

Re: Global Search rewrite

by sam marshall -
Picture of Core developers Picture of Peer reviewers Picture of Plugin developers

Regarding the synchronous behaviour, our users require this for dynamic activities and I suspect others would too. For static content, it's usually OK if the search crawl updates it overnight (we do use a separate, standard search engine to index files and other static content on moodle websites), but for dynamic content such as a forum, people expect that if a user (...or themselves) posted something 2 minutes ago and they search, it should be included. Same for wiki etc.

The system of updating immediately generally works well; it can make users wait a little if they edit a longish wiki page (as it may have to insert a significant amount of junk into the index, taking several seconds) but hey, that's one way of teaching them that this wiki page is TOO LONG and should be broken up.

Re events, I implemented it, and personally I don't trust moodle events. If I were to reimplement it, maybe somebody would convince me, but not sure. smile

Re handling security in the search itself, this is a hybrid model. The search handles default security (which course, $cm, group etc user has access to) within the database, making it supposedly efficient; but after that it is also possible for modules to decide that a result should not be shown to a particular user, by setting ->hide=true in the result to the get_document method. See usage.html in doc folder.

The way ousearch is designed it would certainly be possible to add support for searching things other than course-modules. For example you could probably do this without changing it just by using a suitable plugin name, like say core_user, and setting the other data to the site course or something like that, just using the 'id within module' fields to define the document (i.e. userid in this case).

This is not to say that ousearch is a fantastic search engine. It is certainly the best generic search engine that works in Moodle 1.9 and 2.x right now... but it is also the only one. smile There are two problems:

1. People complain that it lacks 'clever' features such as mistype correction and automatic plurals->singular equivalence. This is basically because, uh, I wrote it in a few weeks, I'm not Google or whatever; we don't have the resources to make a super-quality search engine, only a basic one. Similarly, they would really like if it searched files, so for instance if you attach a file to a forum post with the word 'frog' in the file, they'd like that forum post to show up. My response is 'Tough'. smile

2. Using this database-independent SQL search generally works well but when your tables get quite large (we have 1.3m words, 5.5m searchable 'documents', 350m word occurrences; that last one is a 37GB table) then, on our current database setup, we do currently have performance problems for some kinds of searches. It is possible to resolve these problems; we need to do a bit more work on it...

--sam

In reply to sam marshall

Re: Global Search rewrite

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

Hi Sam,

Lucene should deal with 1 and 2 but also gives more advantages - see the search capabilities of Lucene for example. It can also give us matched text fragment with highlighting.

I think that Global Search as implemented for Moodle 1.9 works just fine - I'm writing something very similar to it, I just want to improve few things.

I was hoping to use Moodle events to get content indexed faster, so what you're saying is a bit worrying :D. In general it looks to me that an event system should be implemented in Moodle - it can come very handy in the system this size. Maybe we should look at some of the existing implementations now (as events are not used too much at the moment) instead of implementing our own? For example see event dispatcher from Symfony.

Tomasz (Tomek) Muras

In reply to Tomasz Muras

Re: Global Search rewrite

by sam marshall -
Picture of Core developers Picture of Peer reviewers Picture of Plugin developers

I had the impression that 'global search' in moodle 1.9 generally does not work, or works only in extremely limited situations and if there is an R in the month. smile

ousearch does give a summary below each result with highlighting of the matched words/phrases (from the part of the document where most of them are found within the summary word length).

By the way, anothing interesting and not always obvious factor is that search is a security-critical function: if the search engine returns results that users are not allowed to access, and if it includes summary text, that is a security problem (especially if it's possible to manipulate the summary text by clever choice of phrase to search for, so you can get it to show the next bit and then the next bit etc). I have had a few issues of this kind with ousearch over the years.

If it's not clear why this is the case, imagine a document 'Model answers to Assignment 1' that has been restricted (via grouping, separate groups perhaps in a 'Teachers' group, or the role system) so that only teachers can access it.

Lucene is clearly a much better search engine (as I noted, ousearch is in no way comparable to any real search engine) and I'm sure you can resolve the Moodle security model issues (probably in a similar way to ousearch where it first figures out within Moodle which course-modules and groups etc you can access, then passes that information on to the search engine when doing a query).

The only fundamental disadvantage is the infrastructure one (requires additional server infrastructure). I'm sure this would not be a problem if we used it at our site, for example. We'd be happy to have a better search solution.

In my personal opinion though, asynchronous updates are not suitable for searching dynamic content (well, if by 'asynchronous' we mean like a minute, then it would be fine, but if we're talking moodle-cron-like times then it would be questionable and if we're talking crawler-like 'eh, maybe overnight' times then it would be downright bad. Also with the current cron architecture, loading additional heavyweight tasks onto cron is a bad idea. I'd encourage you to consider a synchronous approach to integrate Lucene if you can.

One other potential issue is system scalability such as when you operate a system with six web servers (that's our current configuration). For ousearch, this is simple: it's all in the database, so there are no additional considerations as whatever works for the rest of the database is what you'll get for ousearch.

If Lucene indexes are stored in filesystem; what happens when two servers simultaneously try to update the index? Or would it be configured as a daemon webservice (similar to a database server but for search) that operates on its own server and is 'called into' from the moodle server. If the latter, that could be fine (in fact better than fine, really great), but can it be configured as a cluster of 2 servers or can we use failover...? Etc.

--sam

In reply to Tomasz Muras

Re: Global Search rewrite

by Martin Dougiamas -
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Particularly helpful Moodlers Picture of Plugin developers Picture of Testers

Tomasz, I have to say that we had been looking very closely at ousearch already as a replacement for the current search (which will be deleted in 2.2 as it's non-functional and beyond simple fixes).

Happy if you do want to work on something suitable for core but it must be well written and sustainable (unlike the old one). 

One of the things I liked about ousearch is that is stores a lot of required access information in the index, so that you are never shown things you don't have access to.  I think this is a must.

I also liked the fact that things were available to be searched immediately - this is expected these days.

Don't care about the actual solution if it satisfies these AND runs without additional server software (PHP only).

In reply to Tomasz Muras

Re: Global Search rewrite

by rachael harkin -

Hi Guys, I came across this topic its similar to somethin I need to achieve. First of I want to thank Tim for his help before on my rewriting of the course list page. I have managed to do it sucessfully but only on the combolistview since that seems to be the only place on the frontpage that uses the core_course_renderer class. Anyway i'll write a post on how i did it later on for others who wish to know. What I need to know now is if it is possible to manipulate the search results that get sent back, wihtout modifying core file to calla custom function. I cant anywhere that the search operate like the combolistview for the frontpage. Ie.Using the renderer class. Any thoughts?

In reply to Tomasz Muras

Re: Global Search rewrite

by Marcus Green -
Picture of Core developers Picture of Particularly helpful Moodlers Picture of Plugin developers Picture of Testers

On the Wiki page you say this system will use the Lucene search engine which appears to be Java based. This may present a challenge to people wanting to use Moodle on shared hosting space.

In reply to Marcus Green

Re: Global Search rewrite

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

Hi Marcus,

No worries - only original implementation is Java based but Zend provides PHP-only libraries to handle Lucene index (which is binary compatible with the original one). There is no need for a single Java library.

Tomasz (Tomek) Muras

In reply to Tomasz Muras

Re: Global Search rewrite

by Eloy Lafuente (stronk7) -
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Peer reviewers Picture of Plugin developers Picture of Testers

Hi,

some years ago I was experimenting with Lucene PHP by indexing moodle.org forums and, at the end, it was abandoned as failed experiment because it was detected that the memory usage (on index rebuild, hapenning after xxx documents are added) was 1:1 related with the size of the index being created.

And that, after indexing some hundreds of thousands, became completely intolerable.

Since then, new releases have happened, perhaps fixing that horrible memory-constraint problem (afaik the Java flavour implementation is free from that problem).

Just for the records, that should be checked before any further dev IMO, ciao smile

In reply to Tomasz Muras

Re: Global Search rewrite

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

Maybe we should first agree what do we want from a feature called Global Search in Moodle? Let me summarize what I've heard so far. I will also add my own requirements - please comment if you don't agree with any or add more! I would really like us to have a proper search solution in place for Moodle 2.3 - let's make a good decision on this one!

  1. Security must be handled - search should not return any results that are not accessible for the current user.
  2.  Security can be handled globally only up to some level - a module must have final say if a document is accessible. Think about more complex modules (like workshop) that can allow access to a document only at specific time (e.g. only when in peer review phase). Access information can and should be stored in the index for performance reasons but each module should be responsible for the decision of denying or granting access.
  3. External documents must be indexed and search-able. Think about all the docs uploaded into Moodle installations as resources.
  4. Search index must be updated in real time (say with few minutes delay?).
  5. (I think that) notifications should be handled using Moodle events. There is no point in implementing separate notification system for Search - if our existing events are not usable then let's fix them or drop them.
  6. Solution must scale up. Let's put some numbers here that we can use for testing - 1m documents, with average 100 words each (so 100m words in total, 1m unique)?
  7. Initial indexing must be able to stop & pick up where it left, so when a big site enables search, initial indexing can be done in chunks over days (if needed).
  8. A search results must allow ordering by relevancy. This is a big topic in itself but for example searching for "dog cats" should list first documents that contain both words and then those with a single one only. Searching for a "dog" should mark documents that contain word "dog" in several places as more relevant than those that contain it only once.
  9. At least basic queries must be implemented: to allow ANDs and ORs of the terms.
  10. Advanced queries should be implemented to allow for at least:
  • grouping of ANDs and ORs (e.g. word1 AND (word2 OR word3))
  • returning results that do not contain a keyword
  • stemming (searching for "car" will return results with "cars")
  • wildcards (searching super* will find superman and superwoman)
  • attributes search (find XX in title)

There is much more than the above but like proximity, fuzzy searching, case sensitivity, regular expressions, sequence matching, etc but I think that above are the most important features.

11. Solution must not require additional server software (should be PHP only).

 

The solution are 2 things here really: Moodle Search API (basically what modules need to implement to support being search-able) and Search Engine (could be based on custom code & DB, Lucene, Sphinx or any other search engine) . If we design good Search API we could even make Search Engines pluggable. I don't want to make that post any longer, so I'll leave it for now - please focus on what we would like to see from the Search in Moodle.

Tomasz (Tomek) Muras

In reply to Tomasz Muras

Re: Global Search rewrite

by Hubert Chathi -

Minor point: as far a requirements list goes, I would re-write point #3 as: "searching of external documents should be supported".  External documents could be referenced in any number of ways within Moodle.  I don't think that it should be the global search engine's job to take care of everything, but it should allow plugins to tell it about external documents.

And +1 to making the search engines pluggable (if possible).

In reply to Tomasz Muras

Re: Global Search rewrite

by sam marshall -
Picture of Core developers Picture of Peer reviewers Picture of Plugin developers

I agree with these except parts of 9, 10. IMO search engines should always do AND by default, OR is just a 'would you like to get some irrelevant results?' feature - whenever supported, it's intensely annoying. (For example I forget what situation it was recently but iirc Amazon do it - I was searching for something like, say, [made up] convection currents, and it was showing me crap about currents and about convection and I was looking through several useless results before I noticed that the 'real' results had run out.)

I think negative words, OR support, stemming, wildcards, attributes search are not really requirements - they're nice-to-haves. (And in the case of OR support, if it is on by default, that's worse than not having it at all. smile Just imo.)

Oh, except... in terms of the usual 'title' example, attribute search is generally pretty useless, but we did have a lot of demand from users to add support for searching by author name and by date range in forums. (ousearch doesn't support this; we did it independently in ForumNG by running the ousearch query first then filtering.) I guess this generally comes under the heading of attribute search but not necessarily text attributes and hopefully not the sort of thing that users are expected to actually type, i.e. there's a dropdown or something on the advanced search page.

It might be worth thinking about which attributes people want to search on for each type of commonly-used activity and how. Author is a pretty general one but for example that could be complex for wiki (obviously pages have multiple authors). Date is pretty general too. Possibly there might be a need for some kind of per-activity way to specify which custom attributes they offer; so for instance, there is a way to do advanced search of all wikis on the course and this might give you different options to if you search all Pages on the course. Sometimes it could be trivial difference like 'Date edited' vs 'Date posted' text, or different set of supported attributes.

And on the topic of nice-to-haves, I think phrase search is pretty important. That's both in the 'required' sense (where if you search for "frog zombies" in quotes it will only return results where that appears as a phrase, and not ones that just include the two words frog and zombies somewhere in the document) and would also be nice in the scoring/relevancy sense (where if you just search for frog zombies without quotes, it will actually score results higher if they contain the phrase anyway). OU search doesn't do the latter, for performance reasons, which is a bit sucky...

Regarding #11 this should apply to the default solution for sure, but plugins should be able to make use of external systems if possible (i.e. can we define an interface in such a way that the 'pluggable' part does not need to access any moodle database tables - this is possible while still handling security restrictions largely in the external system and partly back in Moodle modules).

It would actually be a significant benefit for our site and probably other large sites if we could put search on a separate server and database; for example, a web service using the Java version of Lucene. Partly because I hear the PHP version is problematic, but more fundamentally because Moodle database is hard to scale and if we can dump one 38GB table into a totally separate database so that queries against it don't affect the rest, that would be a big win. For example if we have performance problems with search and certain unusual kinds of searches end up taking a long time for whatever reason, it would be preferable if that only slowed down other searches, and not the entire system.

--sam

In reply to sam marshall

Re: Global Search rewrite

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

Thanks Sam, agreed - I've put everything so far (I think, feel free to edit) to the wiki page.

My current thinking is exactly the same: we should make generic interface and search engine itself can be decoupled. For out of the box implementation, I would go with Lucene PHP unless there will be some showstoppers (e.g. performance). In this case I would still rather try to fix PHP Lucene issue and submit it to Zend, then to try to re-implement search engine from scratch. I can already say that indexing a lot of documents will not be a problem, as you can re-start indexing with PHP Lucene and merge indexes. It also handles locking using flock() - but either way, I'd start with testing PHP Lucene a bit more upfront.

For bigger installations, one can use solr, which is Lucene based and gives you full search platform. The idea would be that the same interface will work with both systems, you will just need another "plugin" to send the data you your solr instance, instead of PHP Lucene.

Tomasz (Tomek) Muras

In reply to Tomasz Muras

Re: Global Search rewrite

by sam marshall -
Picture of Core developers Picture of Peer reviewers Picture of Plugin developers

Great.

Regarding performance, I think most Moodle installations are small and there could be an expectation that larger installations will set up an external search system (or else not enable search) if necessary. As long the default internal one is not too bad, and as long as it's documented clearly on the screen where you enable search that you might want to think about this first, of course...

Re use of flock, again this might make internal php version less suitable for a large system; for a large system you need multiple front-end webservers connecting to the same moodledata filesystem area via NFS, and afaik flock does not reliably work through NFS so indexes might be corrupted. There might be some ways around this with configuration at the load balancer level or somesuch so that write operations always occur on same server, etc, though.

And yes I agree Solr looks like an excellent possibility for an external platform. So in other words - pluggable search +1000 from me ;)

--sam

PS The OU has been working on UI for search (which integrates ousearch for dynamic activities, and our standard enterprise search crawler for static content). Jenny's in charge of it. The new UI is sort of neat. I wonder whether it would be worth pestering her to do a screencast or something... maybe after our current crunch period...

In reply to sam marshall

Re: Global Search rewrite

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

I wouldn't worry about flock too much. If you have installation this size, then you'll need to look for "proper" search platform like solr or sphinx. If really required, it's not impossible to replace flock locking with some other mechanism.

UI is actually my biggest concern  - I'm rather bad designer so I'm afraid that UI I will create won't look very cool  - I would really appreciate any help here, even if just screenshots.

Tomasz (Tomek) Muras

In reply to Tomasz Muras

Re: Global Search rewrite

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

OK, let's start with the most important bit - API that module will need to implement to produce searchable content. Again, this way the actuall search engine itself can be decoupled - we can have different implementations (starting from OU DB based Search and PHP Lucene to Solr and Sphinx).

I would start with the simplest possible API that should allow for most of the features.

1. We need a way to do the initial indexing that will allow for stopping and re-starting indexing at any point, so:

mod_gs_iterator($from=0)

This will return Moodle recordset that will allow for iterating over available "document set" IDs.

2. We need a way to get from a module list of documents. The $id below is as returned from mod_gs_iterator. The idea here is that for single $id (could be forum post id) a module may return zero or more documents (e.g. forum post and several file attachments). Single document is an object that contains this data. Each module will need to make sure to provide relevant information, e.g. if wiki page was edited by several authors, a module may put only initial author into $user field (or we decide to make this field accept multiple values).

mod_gs_get_documents($id)

3. We need a way for a module to make the decision if access for a user should be granted or not. We also need to handle deletions of the documents, so:

mod_page_gs_access($id)

This would return one of the values: GS_ACCESS_GRANTED, GS_ACCESS_DENIED or GS_ACCESS_DELETED.

And that's it (for the basic support)! Without even implementing events (the above would require a module to declare supporting new FEATURE), this would already allow for near real-time updates if you wish (search would need to pull information usign mod_gs_iterator all the time - not ideal but it would work). Of course there would be still a lot of work to be done to make it performant - e.g. it will be too slow to call mod_page_gs_access for every possible document in a search result, so some logic would need to be performed to filer out obvious "not allowed" documents (similar way to OU search does it I believe). Caching would need to be implemented, etc. - but the most important thing for me is to agree on this API.

One problem that this basic API will not solve are course restores when module data has timestamp from the past - this will not be indexed and I think that an event would be most suitable here (that could be a part of "advanced support").

Tomasz (Tomek) Muras

In reply to Tomasz Muras

Re: Global Search rewrite

by Tim Hunt -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

Yuck!

You are exposing a huge number of implementation details in that API. (recordset, things must identified only by an int id, you are seem to be calling the same thing both a page and a document; overloading the same funciton for access check and is deleted check).

Plus using an abbreviation like gs in the names is just ugly. Let's use the word search. The is core component core_search, in the search folder.

We can, of course compare this to the ousearch API: https://github.com/moodleou/moodle-local_ousearch/blob/master/doc/usage.html, which I have to say is much nicer. (I have never worked on, nor used ousearch.)

 

I suppose I should not criticise without offering something better.

The system will be called search (as I said above) and it will search documents.

One thing that seems nice about the ousearch API is the local_ousearch_document class, that seems (from the implementation of oublog_get_search_document) to hold metadata about a particular document - which part of Moodle it belongs to, the necessary ids, and so on. So we should have a core_search_document class.

(Looking further, at the oublog_search_update example function, I don't like the $doc->update() part of the API. I think that core_search_document should be a fairly plain data-transfer object.)

So, we have a core_search_document that represents a document, a bit like the way that the context object represents a context. (I am not sure we want the core_ prefix in the class name or not. Perhpas search_document is nicer, and sufficiently unlikely to cause name conflicts.) This object should have fields

  • component
  • contextid (replaces cmid and courseid from sams design)
  • documenttype (replaces stringref from sam's design) 
  • itemid (replaces intref from sam's design)
  • timemodified
  • timeexpires
  • groupid
  • userid

With those first 4 fields, I hope you can see that I am roughtly copying the file API, which I think is a good thing. Of course the file area does not have an object to represent a file area, instead you have to pass 4 arguments to each method, which really sucks.

I am not really sure if this class should have the capability to store the content, title and url when required. That would be useful in some, but not all of the API.

In OU search, sam allowed two 'itemid' fields. I don't know why that is necessary. It is not necessary in the file API. If it is necessary, add a new field subitemid. Actually, there is one file area for forum attachments to a post. One post might contain several attachments, so that might be a need for subitemid.

OK, so once we have that class, we can deal with the 3) part of your API we need functions

mod_..._search_document_gone(search_document $doc) // Returns true if the doc no longer exists.
mod_..._search_document_accessible(search_document $doc, $user = $USER)

For loading the content (e.g. to display some search results), you need a

mod_..._search_document_get_contents(search_document $doc)

That takes a doc initialised with just the fields in the bullet points above, and addes the title, content, url fields etc.

For efficiently, it might also be worth implementing a bulk version of that API, which takes an array of $docs to complete.

 

Then, the hard bit is indexing. You seem determined to build an asychronous, pull API, even though both those decisions make the problme harder.

One option would be to hide the asynch part. The API could look like

get_search_engine()->add(search_document $doc);

get_search_engine()->add_all(Interable $docs);

and, behind the scenes, that could just add the documents to a queue, if you don't want to process them immediately.

add_all(Interable $docs) allows you to pass a plain arary($doc1, $doc2), but also any other class that impletemts the PHP api. That way the search system could provide an adaptor core_search_docs_list_from_rs, so you could do

get_search_engine()->add_all(new core_search_docs_list_from_rs($rswiththerightfields));

In reply to Tim Hunt

Re: Global Search rewrite

by Tim Hunt -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

1. Oops, I meant Iterator, not Iterable.

2. Also, I did not really address the question of how to handle initial indexing. Again, the best approach would be to avoid the problem. Just get modules to implementd

mod_..._search_reindex($cmid);

Then find the owner of the biggest, most hairy forum you can (I suggest the General Probelms forum here) and see how long indexing that takes. If it is less then a few mins, then we don't really have a problem.

3. For indexing files, we should add an extra field that can be used as well as, or instead of, ->content. Say ->filearea. (Are we going to index individual files, or the entire contents of a file area. I think there is quite a lot of merit in the latter suggestion.)

4. Another thought experiment it would be worth doing to validate the API: Various bits of Moodle now use the Comments API. How easy would be be to get the comments into the search index.

In reply to Tim Hunt

Re: Global Search rewrite

by sam marshall -
Picture of Core developers Picture of Peer reviewers Picture of Plugin developers

A few points (related to both Tomasz and Tim's comments). I thought the design wasn't bad. smile

A) Stringref in ousearch is not a documenttype; in ousearch you can choose to identify documents by string. This was basically done for wiki which uses page name as its canonical identifier. It was probably a bad idea (even wiki does have a numeric id for pages too) so I suggest not including it. If you're trying to mimic the file system though then you may need a 'filearea' equivalent for cases where a plugin offers more than one type of searchable content.

B) Regarding initial indexing - this is a problem that ousearch doesn't really deal with because we haven't had to (we ensure that ousearch is in use basically at the same time as the modules it searches, so there is no or little initial-index step). So ousearch only has the update_all function. Regarding the API with the $from, which I think is intended to be a date and therefore won't work for restored courses as mentioned, I think the simple solution would be to make $from into an id (system has to keep track of the id).

C) Transparent queuing approach (where you call 'add-this-document-to-search' function and depending on the search system in use, it may add to index immediately, or else add to a queue to be done in cron) seems fine when the system is in general use but I don't think this solves the initial-index problem; it would be inefficient (and might even take a long time) just to add the 'search document' rows to the queue table for every e.g. forum post, looking at millions of rows there.

D) So, initial search update, basically should be done in cron over some days/weeks. I think best way to handle this is probably:

1) Time limit option, set on admin setting. 'Minutes spent initialising search index per cron', eg default 5 minutes.

2) Automatic control so that search UI (fields, buttons, etc) do not display until search index initial update is complete! Maybe done a per-plugin level (see below).

3) In cron, loop through until the time is up using api something like

modulename_search_init_get_next_document($initid, &$nextinitid)

which returns a document field, you can then call the get_documents or whatever to obtain the actual data and index it as normal. in search system you have to store the $nextiniid value so that next time you call the function you can pass it in. Initially call with 0. Returns false when no more documents.

note this API is less efficient because it's not a recordset (so typically one query per search document but the effort involved has got to be a lot less than actually getting document content and indexing it, so probably doesn't matter).

4) Some other function like modulename_search_init_count_remaining_documents() that returns the number of remaining documents to be initialised. This can be used to display a report on the admin setting screen during the initialise process, so like, a table of modules/plugins, showing e.g. 3,940 documents indexed and 248,203 remaining.

5) All this needs to store is a list of which plugins have been done and for the current in-process plugin, what the current value of the (opaque) $initid is...

D) Just a note re the syntax of 'userid' in ousearch if anyone's looking at it; this does not mean content created by a particular user (ousearch does not store that at all), it means content restricted to a particular user in a similar way that groupid means content that (may be) restricted to a particular group. This is used for the 'individual' feature we have in ouwiki and oublog. Normally it's null. Same could be true of groupid.

E) I think it's going to be important in a design for a system like this to think fairly soon about the other API - this is the one between modules/plugins and the Moodle search system - the other one is that between Moodle search system and search plugin. Author name is a problem for instance. Are you going to pass the user fullname (as text) to search plugin along with document so it can be searched, or just the id (possibly array of ids if it supports multiple authors)? The former might seem convenient but won't work properly if users change their name, so I think the latter is probably better.

--sam

commit d73e1bff4b4bdc32d5c6f814eba7f0f193847da8
In reply to sam marshall

Re: Global Search rewrite

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

Sam,

You are correct with the problem for restored courses here:

Regarding the API with the $from, which I think is intended to be a date and therefore won't work for restored courses as mentioned, I think the simple solution would be to make $from into an id (system has to keep track of the id).

8. $id is not a solution here unfortunately - because a module can allow for updating content with any $id, so even if your latest $id is 100, your newest document may have $id=50, as user have just updated it and you need to re-index it. My plan to solve this problem is to add an event course_restored and function mod_get_ids_for_course. I would put that into "advanced" support for search to make it as easy as possible for the modules to start supporting search API.

Tomasz (Tomek) Muras

In reply to Tomasz Muras

Re: Global Search rewrite

by Hubert Chathi -

Maybe you want modules to have a say in how an item shows up in the search result?  Also, you may want to think about how global search fits in with the often-requested anonymity features (e.g. anonymous forum posts) -- e.g. if a non-priviledged user searches for a post's author, their "anonymous" forum post should not show up.

In reply to Hubert Chathi

Re: Global Search rewrite

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

I'll try to cover all the points from Tim, Sam and Hubert.

1. ID as int. I didn't imply anywhere that ID must be an integer, it's PHP - you can use whatever you want in $id. I don't see a point of using non-integeres for IDs, it doesn't seem sane to me - but I don't see the point in enforcing it either.

2. Access and deleted functions. I have merged them on purpose because in reality it's a very similar check: for checking if document X is accessible, you need to check if it exists at all. Returning DELETED is nearly the same as returning DENIED - it's just that serach engine may use this extra hint for optimization (removing the document from index). It also removes the need for extra event (document_deleted).

3. Prefix name. I have used search before but I think it's too ambigous in Moodle. You can already "search" for users, you have forum search, etc. "Global search" seemed like a better name for the general search and gs is just shorter. Besides that, I couldn't care less - we can call it search.

4. get_search_document function. I think this is not needed - if you store the information in the index, there is no need to get the full document from the database again - you have all the necessary information in the search index itself. Think about what happens if you have to retrieve each search result first from a database, after the search:

* you will generate additional DB queries - at least as many as search results per page

* you will make it impossible (from the performance point of view) to get the grouping information about the search results, if information is not in the search index. See sample rian query and look at the boxes on the right hand side with numbers (number of search results per atrribute). If this information is not in the search index, then you would have to fetch all records from DB and calculate numbers for them.

5. search_document class. I agree, it should be used just for holding the information. I would put all the fields you suggested + the ones I've listed. The more we have, the more we can get out of the search engine. Most of the fields would be optional anyway (some won't make sense if you want to do search for users for example); also the actual implementation of the search engine can ignore some of them - no problem.

7. Initial indexing. It will be a problem for bigger installations, I'm telling this from an experience. Indexing forum posts may be simple but think about indexing thousands of word documents - just parsing them will take a lot of time. We need a way to re-start initial indexing, so we need something like mod_gs_iterator($from=0).

6. mod_gs_iterator($from=0) returning Moodle recordset. This seems to me  like a sensible option - you can't return all the results, you need to return an iterator but you're right - it doesn't need to be moodle_recordset - maybe sometimes the documents are not returned from a DB but are generated somehow. We could make this function to return something implemeting Iterator interface.

7. The same class for a page & document. Why not? In the end everything will be a document for search engine, no matter if it came from a forum post or attached document. To make it nicer we can use inheritance here though.

8. Synchronous vs asynchronous & push vs pull. I'm not determined to build an asychronous, pull API - I'm determited to build both. We need to solve the problem of the initial indexing (in a way that Sam has described, I will document it on wiki instead of making this post even longer) so you need something like: mod_gs_iterator($from=0). And once you have it, you simply don't need to push the documents to be indexed into the queue to get synchronous push! All you need is a Moodle event, the way it would work:

* search engine just finished indexing forum posts, the timestamp of the last forum post indexed is 100

* new forum post is added, forum module implements an event and triggers it (post_created)

* search engine runs (asynchronusly, as a response to an event) and indexes all forum posts with timestamp > 100

OK, I think it got too long - I'll document more on the wiki, cheers guys.

Tomasz (Tomek) Muras

In reply to Tomasz Muras

Re: Global Search rewrite

by Tim Hunt -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

7. each document is likely to be a different resource, so a different $cm. You have not yet proved that we need to break down into any finer granularity than $cm-sized chunks in initial indexing, and the number of $cms is core moodle knowledge. You don't need a new API for that.

In reply to Tim Hunt

Re: Global Search rewrite

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

Tim,

I don't get what you're proposing here - how do you suggest initial indexing should look like, to allow re-starting it? Could you describe it?

Why do you keep referring to $cm? I think we should forget about it - most modules will use $cm but in other contexts it doesn't exist - think about indexing users and their profile information. Search Engine should not require or care about $cm, it simply gets and ID from a module and uses it. The unique ID for search engine would be <module_name>:<id>. Search Engine doesn't need to know or care what's the meaning of id. $cm, $courseid or other information may, however be used be a serch enginge for the optimization reasons (e.g. to filter out all the results for the courses where user does not have an access).

Tomasz (Tomek) Muras

In reply to Tomasz Muras

Re: Global Search rewrite

by Tim Hunt -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

I am using $cm as a convenient short-hand.

Probably the biggest things to index in Moodle are single large module instances. If it is sufficient to break down the job of indexing an entire Moodle site into chunks the size of one $cm or smaller, then we don't need to invent a special API that all plugins have to implement just to handle the initial indexing task. The search engine course can deal with iterating through all the indexable things in the site by tracking its progress through known existing structures.

I think that once a particular activity (or whatever) has been indexed, we should use a push API from the activity to keep the index updated. That is, the module should notify the search system of what needs to be re-indexed when it is changed, like with the ousearch API (and, as we discussed earlier, the search API can either process it instantly or queue it.)

In reply to Tim Hunt

Re: Global Search rewrite

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

Agreed - that's (nearly) exactly what I'm proposing here. Search Engine will be indexing in chunks based on (say) similar logic to the stats generator - e.g. it will index for 1h and then stop.

You just need to get that iterator from a module, as only a module/plugin can tell you what indexable document is and how to iterate it. Also only a module will be able to properly keep track of "what should be indexed next". Think about a document (whatever it is) being updated before your initial indexing is done but after that document was indexed once.

In reply to Tomasz Muras

Re: Global Search rewrite

by Tim Hunt -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

Surely, for other reasons (e.g. backup and restore), there will already have to be a function like

mod_forum_add_to_search_index($cm)

that, for a given forum instance, uses get_search_engine()->add_all() (or whatever) to add all the content from that forum to the search index.

If so, and if 'one forum an a time' is sufficiently granular, then we don't need any other API to implement the initial indexing.

In reply to Tim Hunt

Re: Global Search rewrite

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

I will try to avoid having mod_forum_add_to_search_index($id) and get_search_engine()->add_all() altogether, if only possible. Restore situation is an exception and I may handle it as such - in some special way. Using only something like ->add_all() creates more API - for initial indexing you need something to iterate over all the content stored so far (e.g. iterate all forums, add forum, add documents from a forum). I also don't like the idea of queuing all documents to be added. For initial indexing it may mean queuing millions of documents. Iterating over the last modified timestamp is very clean - you just need to remember when did you stop.

The important bit here is that I would like to use Moodle events instead of search-specific API that modules would need to implement. So, when a new forum post is created, I would like forum mod to trigger post_created (or similar) event, not to call get_search_engine()->add_new_document(). The search code would register as a listener to all the events like that. This way we would kill 2 birds: get synchronous updates for search, and get a new event that may then be used by any other code. It would be the same for backups: an event backup_restored would be fired.
I think that the events would be very useful in Moodle and it's a shame we've implemented only so few of them (and that we don't use them).

Tomasz (Tomek) Muras

In reply to Tomasz Muras

Re: Global Search rewrite

by Tim Hunt -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

Perhaps part of the reason we are disagreeing is that we have not agreed on the requirements. With this API there are two people involved:

  1. All the contributors who maintain all the plugins that will need to feed documents to the search system (including me); and
  2. the people who build and maintain the search system (you).

I think the requirement is to make it to make it as simple as possible for group 1. and I don't care how difficult it makes life for group 2. (I try to take a similar attitude when building the question engine, it is the people who write question types who are important.)

 

And, you seem to be asking plugin developers to implement some sort of asynchronous interator, and that is hard and yucky.

By the way, sam explained above why iteration based on timestamp will not work. Did you get that? I can explain in more detail if you like.

Also, I did not say "queue all the documents in a Moodle site, then work thorugh the queue" as a solution for initial indexing." My proposal is, the search system manages this process (a bit at a time, on cron):

  1. Mark all contexts as requiring indexing.
  2. For the first contextid that requires indexing, queue all the documents from that contextid.
  3. Process the queue until it is empty.
  4. If there are more unindexed contexts, go to 2.

Typically, 2. will queue a few tens, hundreds or thousands of documents. That is all.

I think that for plugin authors 'Tell me all your documents from this contextid' is conceptually very simple to think about, and therefore easy to implement.

 

In terms of handing update, how will the search system know what index updates are requried by a post_created event? That is impossible unless you put a lot of forum-specific knowledge into the search engine, and if you do that, third-party plugins can't be searchable.

You have to have the plugins translating the changes in their data into CRUD actions on search_documents.

Again, for me, it seems conceptually much simpler if the plugin author has to write a few simple, synchonous search API calls in with the back-end code. For example, when a use makes a forum post 

https://github.com/moodle/moodle/blob/master/mod/forum/post.php#L661

There is add_to_log, and updates of completion status, and pushing files to the files API. Why not one line to push the new data to the search system too. Forcing the search update code to be somewhere else just makes it harder to find and maintain. (Mind you, that post to forum code is pretty old and messy. Roll on Forum NG.)

 

Also, in the situation where we are still doing initial indexing, so half the contexts are indexed, and half not, then it is easy for the search system to just ignore any udpate requests that affect contexts that have not been indexed yet, while recording the changes to things that are already in the index.

 

Anyway, I am sitting here thinking "How easy would it be for me to use the global search stuff in the question bank and the quiz", and that is why I don't like your API.

In reply to Tim Hunt

Re: Global Search rewrite

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

Perhaps part of the reason we are disagreeing is that we have not agreed on the requirements. With this API there are two people involved:

    All the contributors who maintain all the plugins that will need to feed documents to the search system (including me); and
    the people who build and maintain the search system (you).

I think the requirement is to make it to make it as simple as possible for group 1. and I don't care how difficult it makes life for group 2. (I try to take a similar attitude when building the question engine, it is the people who write question types who are important.)

I agree and in fact my goal is to make it as simple as possible for module creators. I've simplified the API to only those 3 functions because of this exact reason. I though this requirement goes without saying but to make it official I've added it to wiki.
 

And, you seem to be asking plugin developers to implement some sort of asynchronous interator, and that is hard and yucky.

See the actual implementation for forum, I'm not sure how could it be any easier than this:

function forum_gs_iterator($from = 0) {
  global $DB;

  $sql = "SELECT id, modified FROM {forum_posts} WHERE modified > ? ORDER BY modified ASC";

  return $DB->get_recordset_sql($sql, array($from));
}

The implementation for your questions would be:

function question_gs_iterator($from = 0) {
  global $DB;

  $sql = "SELECT id, timemodified AS modified FROM {question} WHERE modified > ? ORDER BY modified ASC";

  return $DB->get_recordset_sql($sql, array($from));
}


By the way, sam explained above why iteration based on timestamp will not work. Did you get that? I can explain in more detail if you like.

That is correct - the this iterator will not be enough for the restored backups - do you have some more examples of what else may not work?

There is no need to add queuing documents one-by-one (that I'm trying to avod here as much as I can - and it would be required otherwise - please read on) as this can be solved in at least two other ways:

1. By doing full re-indexing.
2. By adding another, id iterator:

SELECT id FROM <table> WHERE id > ? ORDER BY id ASC

It could be in another API function but I would make it optional to implement - handling course restore with the data would probably cover < 1% of the use cases. But anyway - as long as module always generates incrementing IDs, the problem is solved.


Also, I did not say "queue all the documents in a Moodle site, then work thorugh the queue" as a solution for initial indexing." My proposal is, the search system manages this process (a bit at a time, on cron):

    Mark all contexts as requiring indexing.
    For the first contextid that requires indexing, queue all the documents from that contextid.
    Process the queue until it is empty.
    If there are more unindexed contexts, go to 2.

Typically, 2. will queue a few tens, hundreds or thousands of documents. That is all.

I think that for plugin authors 'Tell me all your documents from this contextid' is conceptually very simple to think about, and therefore easy to implement.

Here are the problems with this idea:

1. What is context? Think bigger then just a module, what would the context be for a list of Moodle users? In any way, by context you mean some sort of grouping.

2. By marking you mean queuing - so you need to implement a queue here which may not be necessary at all.

3. One "context" from the queue would need to be processed in one run. How will you guarantee that context is not too big? If 1 context = 1 forum then what happens if the forum is too big to be indexed at once? The indexed will never finish. Processing DB entires is quicks but think about a forum with a lot of word documents attached.

4. What happens when a forum post is updated? You will need to schedule it in another queue for a signle document - so we have 2 queues alread (read on why you have to queue it).


In terms of handing update, how will the search system know what index updates are requried by a post_created event? That is impossible unless you put a lot of forum-specific knowledge into the search engine, and if you do that, third-party plugins can't be searchable.

You have to have the plugins translating the changes in their data into CRUD actions on search_documents.

See point 8 from my previous post - it can be done without any forum-specific knowledge (except of the name of the event to listen to).


Again, for me, it seems conceptually much simpler if the plugin author has to write a few simple, synchonous search API calls in with the back-end code. For example, when a use makes a forum post

https://github.com/moodle/moodle/blob/master/mod/forum/post.php#L661

There is add_to_log, and updates of completion status, and pushing files to the files API. Why not one line to push the new data to the search system too. Forcing the search update code to be somewhere else just makes it harder to find and maintain. (Mind you, that post to forum code is pretty old and messy. Roll on Forum NG.)

Even with the call like this, you still need to queue the document and that's why I think this call is not needed at all. You may think that adding to the search index is always quick and can be synchronuosly but consider for example forum post with big word document attached. Processing it may take 1 minute, you wouldn't like a user to wait 1 minute to finish adding a post. Or think about PDF - normally extracting text from PDF is quick but if you want to do it properly and connect OCR utility then you're talking about minutes for processing. Even for a short forum entry, think about using remote Search Engine (e.g. SOLR) - a web service call must be made to update it, so you are adding 1 second to the user's wait time. On top of this add problems with locking - if search index is locked while being updated then we're escalating the problem. I think that in general everything possible should be done asynchronusly to make user's experience as good as possible.


Also, in the situation where we are still doing initial indexing, so half the contexts are indexed, and half not, then it is easy for the search system to just ignore any udpate requests that affect contexts that have not been indexed yet, while recording the changes to things that are already in the index.

That's a good idea and can be useful for any implementation - I will keep it in mind.

Anyway, I am sitting here thinking "How easy would it be for me to use the global search stuff in the question bank and the quiz", and that is why I don't like your API.

No matter what, you have to implement  something like "mod_gs_get_documents" and "mod_page_gs_access" so I guess you don't like the iterator again - but its implemenation is extremaly simple - see above, or example for quiz:

function quiz_gs_iterator($from = 0) {
  global $DB;

  $sql = "SELECT id, timemodified AS modified FROM {quiz} WHERE modified > ? ORDER BY modified ASC";

  return $DB->get_recordset_sql($sql, array($from));

}

C'mon - it can't get any simpler than this, plus for the core modules, I will write the API myself and send to the component maintainers for review & integration.

Thanks for the input so far - cheers,
Tomasz (Tomek) Muras

In reply to Tomasz Muras

Re: Global Search rewrite

by Tim Hunt -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

Another case where timestamps don't work is simply that a second (the resolution to which we store time-stamps) is a long time to a computer. What happens at the moment when the search index things it has got a list of all ids up-to-date, and 0.1s later, a user adds a new forum post. That post will never be indexed.

You can see that occasionally in places where Moodle does order-by id, and you keep reloading the page, and the sort-order is not stable.

Iterating by id is more likely to work.

 

Of course, there is a down-side with iterating by id. That is, you are forced to index in order, from oldest first. With an queue, you could use a better order (for example bump a course to the top of the queue when someone tries to search there). Indeed, one day you could implement a really clever system for huge sites where any context where no-one has done a search for 6 months is purged from the index, but is re-added if ever anyone tries to search there. Certainly not something to implement now, but you are ruling that out completely.

 

1. For each user, there is the user context. You can iterate over those.

3. As I have said in just about every post, we need to see what the biggest possible context is. And I keep suggesting that the General problems forum here is a credible candidate. It is big, but it is big for a full-text-indexing system? We need to measure.

4. OK, * deals with a new forum post. What happens when a moderator moves a thread to a new forum? Or splits a thread at a given post.

 

I still don't see where in your proposed API is the knowledge that the text of a forum post is forum_posts.subject and forum_posts.message, and all the files in two associated file areas. Or that the text of a multiple choice question is question.questiontext, question.generalfeedback, SELECT answer, feebdback FROM {question_answers} where question = :questionid, and SELECT correctfeedback, partiallycorrectfeedback, incorrectfeedback FROM question_multichoice WHERE question = :questionid, and all the files in all associated file areas.

In reply to Tim Hunt

Re: Global Search rewrite

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

Another case where timestamps don't work is simply that a second (the resolution to which we store time-stamps) is a long time to a computer. What happens at the moment when the search index things it has got a list of all ids up-to-date, and 0.1s later, a user adds a new forum post. That post will never be indexed.

You can see that occasionally in places where Moodle does order-by id, and you keep reloading the page, and the sort-order is not stable.

Iterating by id is more likely to work.


It's not a problem, in all the queries it should actually be WHERE timemodified >= ?. This way you will never loose a document set and the worst that can happen is that a document set will be indexed twice and you loose some time - this is handled already.

Iterating by timestamps will not work for updates and iterating by ID will not work for updates. Only both will give 100% coverage.

 Of course, there is a down-side with iterating by id. That is, you are forced to index in order, from oldest first. With an queue, you could use a better order (for example bump a course to the top of the queue when someone tries to search there). Indeed, one day you could implement a really clever system for huge sites where any context where no-one has done a search for 6 months is purged from the index, but is re-added if ever anyone tries to search there. Certainly not something to implement now, but you are ruling that out completely.

It's not necessarly ruled out - it could be done with a separate list of contexts blacklisted/whitelisted/whatever. I agree that ff there would be much more features like this in the future and I implement this very smart search system, then it may make sense to change the implementation into queue.

4. OK, * deals with a new forum post. What happens when a moderator moves a thread to a new forum? Or splits a thread at a given post.

When a thread is split then I would expect timemodified to be updated. When a thread is moved - is there a need to do anything? The ID of the posts will not change, so there is no need to update anything?

I still don't see where in your proposed API is the knowledge that the text of a forum post is forum_posts.subject and forum_posts.message, and all the files in two associated file areas. Or that the text of a multiple choice question is question.questiontext, question.generalfeedback, SELECT answer, feebdback FROM {question_answers} where question = :questionid, and SELECT correctfeedback, partiallycorrectfeedback, incorrectfeedback FROM question_multichoice WHERE question = :questionid, and all the files in all associated file areas.

It's in mod_gs_get_documents see the POC implementation for forum, glossary, label, page and resource. The way it works: iterator returns ID that is meaningful for a module. This ID points to a zero or more docuements retrieved with mod_gs_get_documents.

Tomasz (Tomek) Muras

In reply to Tomasz Muras

Re: Global Search rewrite

by Tim Hunt -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

Note that all your example code above used >, which suggests that your proposed API makes it really easy for people to introduce a subtle bug without realising it. wink

When http://moodle.org/mod/forum/discuss.php?d=189806 was split out of http://moodle.org/mod/forum/discuss.php?d=188627, where all the last-modified dates for each post updated? Doesn't look like it to me.

When you move a thread to a new forum, that is a new context, so different permissions apply. We were talking above about content being associated with contextid in the index.

 

P.S. didn't we discuss above the _gs_ was horrible, and that _search_ would be better.

Having seen https://github.com/tmuras/moodle/blob/gs/search2/, I would say:

  1. putting the code in search2/mod/glossary hopefully just a short-term measure. That obviously does not work for third-party plugins. The sensible choices are either to add the new functions to mod/glossary/lib.php, or to have a new file mod/glossary/searchlib.php. I would vote for the second.
  2. glossary_gs_iterator would be better as something like glossay_get_search_indices_since($from = 0)
  3. glossary_gs_get_documents would be better as glossary_get_search_documents($id)
  4. glossary_gs_access($id): have you counted how many DB queries that funciton is doing?! Imagine calling that on a page that displays lots of search results. Ouch! Similarly, the require_course_login in there looks wrong. You are also using Exceptions for normal flow control. That is wrong, and will get your code rejected by the integrators (I say that from experience.) 
  5. This acces check logic must be duplicating logic elsewhere in the glossary module, will it be possible to refactor to eliminate the duplication?
  6. Where are forum.intro and glossary.intro added to the index?
  7. You are forgetting to add files in the mod_forum/post file areas to the index.
  8. Did you think about my suggestion above that it would be easier (for plugin writers) if you considered the files in the file area to be an integral part of the document. (That is, a forum post is the post text + attached documents + embedded documents.) Since there is not way to access the attached files without going thorugh the post, isn't it more sensible for the serach system to return the post as the search result?
  9. Anyway, in the two places where you loop over a file area and add all the attachments as documents, there seems to be a lot of duplication. I assume you will be refactoring to make a helper function.
  10. Where are/will comments on glossary entries added to the index?
  11. I note that there are no unit tests yet in the search2 folder. Reading the unit tests is a good way to understand what an API is supposed to do. Will you be writing some?
  12. I assume that, as you get towards the end of development, you will be using https://github.com/timhunt/moodle-local_codechecker or similar to bring your code in line with the Moodle Coding guidelines.
In reply to Tim Hunt

Re: Global Search rewrite

by Martin Dougiamas -
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Particularly helpful Moodlers Picture of Plugin developers Picture of Testers

Just skimming through, but please please please let's not even consider having any module-specific code under /search

In reply to Martin Dougiamas

Re: Global Search rewrite

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

Regarding splitting forums and timestamps.
The easy answer is:
* when the split is done, make sure the timestamps are updated; or
* deal with it in the same way as restored backups (when new ID is created)

But the real problem is a bit harder - I think that there always be some edge cases where the ID or timestamp will not change but the module is updated. For instance problem highlighted by Sam - if we put user names into search index (and it's a very good idea to put them there to allow for search and limit calls to the DB) then what happens when somebody updates the profile? We could have an event "profile_updated" to which Search Engine would listen and update all the affected documents - the same would need to be done for any other piece of information stored in the search index.
The easiest practical solution is to simply do full re-indexing every now and then. That will cover every problem not just those that we didn't think of.

Regarding gs_ and search_ prefixes: search_ is OK but it may be confusing as there are several "standalone searches" in Moodle (searching users, forums, etc). Let's make this minor decision on the dev meeting.

Regarding glossary_gs_access performance: yes, this is an obvious place to look for optimizations (e.g. filtering results that user can not see because he is not enrolled in a course). The second, related place is a piece of code in the search engine where relevancy is calculated.

Regarding Tim's points 1,4,5,6,7,9,10,11,12 and Martin's note: it's way to early to talk about minor code details (the code on github is just a POC, most of it won't be used) - I would rather have us focused on the API but since we are at it:
The code I've temporarily put into search2/mod/... is actually copy & paste from the modules. A lot of Moodle code will need to be re-factored to expose functions required to implement global search. Access checks are duplicated across core modules, some code should be cleaned up and exceptions are already used for the flow control - we would need to make the decision that we either put extra effort into refactoring existing code & updating some APIs, or agree that we are OK with low quality for the new code.
There should absolutely be no module-specific code under search, it all needs to be maintained with the modules (hence the idea of doing it as a FEATURE).

2. We just said that 2 iterators would be needed, so maybe mod_get_search_indices_since_time($from=0) and mod_get_search_indices_since_id($from=0)
3. Agreed.
8. I would like to present two links on these occasions (see http://docs.moodle.org/dev/Global_Search#mod_gs_get_documents.28.24id.29) - users could download the document directly or open the forum post.

Tomasz (Tomek) Muras

In reply to Tim Hunt

Re: Global Search rewrite

by Gavin Henrick -
Picture of Plugin developers

I have been following the discussion, and I just was wondering is there a list of user requirements rather than technical requirements for the global search?

So for example..

  1. A student enters a word or phrase (with or without logic) to search    
  2. The user can select a course, or category, or site-wide to perform search  
  3. User can select module type to search against, all or some or...)
  4. Clicks Search
  5. The results are returned Grouped by site, category, course, activity or not.. (display options for search) - paginated, with various filters available.


Each result is returned a unique url and context 
- like the forum search, link to post or thread
- link to glossary or link to glossary entry
- link to a file resource
- link to a book page, or the book
and so on..

The result (post, or entry) comes up if the anything belonging to it gets the hit.

The data that is searched against is module dependent, however would include the 

Name
Description
Any other text/html in the module entry
Any files which belong to it (their name, and content where possible)
Tags


This way a file attached to a forum post would not be found except through the forum post, thread etc.

Some other questions are :

Should items be searchable through author or license? 

  • (search throughall the creative commons resources in course A for example)
  • (search through all items created by Bob )

Just some random-ish thoughts.

 

Sorry if the above already exists somewhere, I couldn't find it.

In reply to Gavin Henrick

Re: Global Search rewrite

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

Hi Gavin,

Some things will depend on the search engine (e.g. filtering and grouping) - if it's not supported then it's very unlikely I will be implementing them. Some other will depend on the effort we will put into modifying core Moodle modules - so there is quite a few technical obstacles to tackle here.

My starting point for the development would be the existing Global Search - I would (more or less) replicate existing features. If only possible, I would take the "neat search UI" developed by OU as mentioned by Sam.

It would be very good to keep all the ideas and "wishlist" documented somewhere - please feel free to add it all on wiki.

Tomasz (Tomek) Muras

In reply to Tomasz Muras

Re: Global Search rewrite

by Gavin Henrick -
Picture of Plugin developers

I understand the choice of engine will impact it. I just believe it would it  be best to think what are the most likely searches that a user would want to be able to do?

be it a student

or a teacher

I can imagine that being able to limit to the course, or category would be really important, especially if a teacher has access to a lot of courses.

Also, if looking for something specific, having results polluted by data not related to it, would not be helpful. 

Average of ratings: Useful (1)
In reply to Gavin Henrick

Re: Global Search rewrite

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

Hmm, limits for the course/category sound like a good idea. It would then be very simple to create a course-level block that searches within the current course.

As for selectively choosing the results, as far as I remember the current Global Search provides a list of checkboxes, where a user can choose which resource is searched (forum, resource, etc).

Tomasz (Tomek) Muras

In reply to Tomasz Muras

Re: Global Search rewrite

by Jürg Hoerner -

Global search is one of the module that is very important for schools.

The actual search result with a discription of the course is OK. If a sortfunction (courses) could be added it will be a very good feacher.

If you need any help let me know.

Jürg Hoerner

In reply to Tomasz Muras

Global Search by Voice...

by Gavin Henrick -
Picture of Plugin developers

So after Dan did some cool work on a Siri inteface to Moodle Tracker,  how about considering:

SSFM - Siri Search for Moodle

Interface so that SIRI can search the users Moodle using your global search work to return options?

Perhaps even go ahead to embed this into Moodle mobile app?

"Where is that powerpoint i was reading yesterday"

"find me the forum post about $sucbject X" 

Can imagine other uses such as "Siri message my teacher Tom on Moodle to ask for an extension" "Siri will you take my quiz for me" ? wink

In reply to Gavin Henrick

Re: Global Search by Voice...

by Colin Fraser -
Picture of Documentation writers Picture of Testers

No, but I can imagine "Siri, is your real name Skynet?" and "Where are you Siri?" with "I'll bee barch." smile

In reply to Gavin Henrick

Re: Global Search by Voice...

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

If somebody would like to work on the Apple's part, I could support him on the "Moodle side".

Tomek

In reply to Gavin Henrick

Re: Global Search by Voice...

by Dan Poltawski -
Unfortunately the work I did is not based on an officially supported API.

In fact its well and truly a hack (you need to intercept the requests for an apple.com domain name and install a custom certificate authority on the phone so that it accepts the unofficial ssl certificate).

So there would be little point in working on this until it was officially supported.

But even if this did become an officially supported API there were some interesting challenges which came up even when doing it in this hacky way. Siri wants Moodle to be noodle, bug to be book (and other such compromises which make it surprisingly accurate most of the time).

Lets crack basic text based search before we start integrating voice recognition with contextual cleverness wink