I'd be happy to run some performance tests on the different options, but could someone confirm what the schema would look like for option b (1 table per data type)?
Here are the use cases I've come up with, let me know if I've missed anything:
1. Return all profile field data for 1 user (e.g. view user profile)
2. Return all profile data for multiple users (e.g. a user's report)
3. Return all users where one particular field value matches a string (searching site for match against 1 field)
4. Return all users where any field value matches a string (more general search case)
Let me know if I've missed anything?
Simon
I no longer work for RemoteLearner, but I was involved in the design of the custom fields. You can see the database schema here: https://github.com/remotelearner/elis.base/blob/master/core/elis/core/db/install.xml#L106 The tables of interest are the elis_field_data_* tables. There is one table for long text, short text, integer, and decimal data. Each table has the same basic structure: the regular id field, a field_id (indicating the custom field that the data is associated with), a context_id (indicating the context[1] that the data is associated with), and a data field.
There are a few other tables, but those four are the main ones of interest in this discussion.
An advantage of this method is that you don't need to grant permissions to modify the database schema. A disadvantage is that you're stuck with "four sizes fit all", which is an improvement over the "one size fits all" approach of Moodle's custom user fields, but you can't, say, customize the precision of numeric data.
[1] Custom fields are associated with contexts, so fields can be attached to courses, users, activities, categories, or ELIS' custom contexts. If I were to redesign it from scratch today, I might instead have "type" and "instanceid" fields, similar to mdl_context's "contextlevel" and "instanceid", or mdl_course_modules' "module" and "instance" fields. This would let you attach custom fields to anything that had an ID, which may not necessarily be a context.
I've done some quick testing on the queries, full explanation of what I did and how here:
https://github.com/simoncoggins/cf-perf-test
If anyone wants to see if they can optimise any of the queries then please go ahead, I just did it quickly so there might be room for improvement in some of them. Also let me know if I've misunderstood any implementation details. I didn't bother with indexes on the data fields at this stage.
As for the results for the 4 cases listed above this is what I found:
Schema | Case 1 | Case 2 | Case 3 | Case 4 |
Existing (1 join per field) | 1.494 | 37.076 | 190.889 | 202.240 |
Existing (pivot hack) | 1.151 | 45.086 | ||
ELIS data type tables | 1.719 | 59.990 | 199.228 | 210.616 |
Martin's user_extend | 0.919 | 32.123 | 196.734 | 196.383 |
All query times in milliseconds. The pivot hack refers to a way of getting the custom data via a single join suggested by Loic - see the query for details.
So basically the user_extend schema is marginally more efficient, although there's not much in it. It is significantly easier to write the queries for though.
Simon
And it not just being used for user profile fields, but courses, file metadata, etc., etc..
Imagine a Moodle with 20 custom fields for each of 3000 courses, and 10 custom metadata fields for 500000 files.
Another use case for users is searching for users who match several custom fields of different data types.
Jenny,
Are you suggesting that the same field definition would be used for a course and a file? I think in both solutions that would never be the case as the field definitions would be unique to the context level. In the ELIS case it's explicitly tied to context level and in the extension table it would be a new column in the extension tables for courses and files respectively.
Or am I missing something?
It probably depends on what you consider a "Field definition".
I think we do want Metadata field types defined globally. For example, if we have a field type "Licence" with a fixed set of possible values like "All rights reserved", "CC-BY-NC", ... then the set of permitted values should be the same for all types of object that have a License field.
However, the Licence field for objects of type Resources could well be a different field than the Licence field for Qusetions in the question bank.
On the other hand, even if they are separate field, we may still want to search for objects of any type where Licence = CC-BY-NC.
Indeed, Objects that metadata can be applied to would naturally form a type hierarchy, and you might apply some metadata fields to the base Object type, and others only to specific sub-types - or that might be too complex.
Clearly, lots of design decisions to make here.
Hm, its an interesting decision on what the definitions table should hold if we want the same field on multiple things. I agree with Tim that you would not want to define things twice, but just once. You could do it through the UI by "copying" a definition, but this would cause problems in future if you updated one and not the other. Perhaps you would create a definition on one thing and then have a final setting in the defintion form "applies also to" with a checkbox/multiselect list of other things you could add it to... or even just a checkbox/multiselect for "applies to" in the first place.
Then our choice is about how we reflect this in the definitions table - I think we have to create multiple entries, one for each object.
So where that gets us for the queries, is Justin is right that the unique identifier for the object being extended would make them look like two different fields even though the definitions themselves are the same thing to the human eye.
But if we define it at the system level, then it would be injected into all the edit forms for everything - user, course, module, file ... and it might not actually make sense on some of them. Licence for example would be fine on all those except user.
That might be OK because its never come up for you, and because you have your own setup screens for the elis fields, but I'm not sure it will work in the new core version that we've been designing because we don't know what different admins might do with it.
From the performance perspective though, I think the query for "file licence=x and course licence=x" is no more arduous than "course licence=x and course author=y".
Actually, the ELIS database tables were designed to allow for a field to be associated with multiple context levels, by having multiple entries in the elis_field_contextlevels and elis_field_category_contexts tables (which is why those are separate tables, and not just columns in the elis_field and elis_field_categories tables).
But at some point we decided that the extra complexity wasn't worth supporting at that point. IIRC, the complexities were mostly in the UI, and in making sure that things were consistent, but I don't remember the details. One issue is: it might be confusing if you allow fields to be attached to different context levels, then if, say, you have a course and an activity with a license field, how do you know if they're the same field, or different fields with the same name?
Thanks for the tests. Could you change the index in the ELIS tables so that there's one index with all three fields (fieldid, type, itemid). I don't know if it's the same with Postgresql, but with MySQL, it will only use one index on a table at a time.
I'm not sure how much of a difference it is that queries are easier or harder to write. Most queries would be written by code anyways.
Hi Simon,
Thanks for those numbers. I can run some tests on a really large dataset (100k users) with a single field and maybe something like 20+ fields. I can also do this with one of our reports so it's using an example of a real query that our system is currently using.
Hopefully the results aren't terrifying. ;)
This is the link highlighting the relevant fields for the schema in the current 2.3.1.1 version of ELIS:
https://github.com/remotelearner/elis.base/blob/elis23/core/elis/core/db/install.xml#L52-165
It looks like our master branch got mangled in that Github repository so Hubert's original link will be pointing to the wrong thing when that is un-mangled. =)
Hi
Aside those implementation issues, I'd like to focus on the expected features. The current custom fields on users (AKA user profile) make it possible to:
- define custom fields of several types (checkbox, datetime, string, etc) that apply to a target (the generic user),
- order them and group them by categories,
- put attributes on this fields, like a description or a visibility scope.
- fill in these fields with values, then fetch theses values for a given record (an user).
If I understand correctly, the two proposed implementations would break some of these features.
Here are a few questions on the purpose of the future custom fields:
- Should the custom fields be declared at the platform level or created at the fly? E.g. should a course form allow the teacher to add whatever key/values he wishes for?
- If a previous definition is still required, should a custom field be defined per target (user, course, etc) or globally? Or both, i.e. some fields apply only to a specific target and some to everything?
- Should a category system still exist on custom fields?
Our POV:
- If custom fields definitions were to disappear, it would be a major drawback. We really need to declare that some added fields are locked or hidden.
- A per-target definition is important to us, whereas global fields are not. Global fileds would also add complexity (both for the users and the developers).
- We need categories when the are dozens of custom fields.
For a quick explanation of our POV, we have developed for a client a custom field system in Moodle. It's in the tracker with the code and documentation on Github . It is an extension of the current user profile system to other objects. Its SQL schema was partly de-normalized for performance reasons. We only added an UI to handle course custom fields (by sharing code with the user profile UI), but it would be easy to add interfaces for other objects (cohorts, etc), or create a generic web page for the custom fields of all targets.
How do you think that the two proposed implementations break your expected features? Remember that the two proposals only cover storing the actual field data. We haven't discussed storing the field discriptions yet. For example, if you look at the ELIS database schema (linked above in Justin's post), you can see some of the description data that ELIS has, which includes field categories.
As far as your questions go:
- Fields absolutely must be predefined. Having different fields for different courses would create a big mess.
- It seems to me that everyone agrees that fields should be defined per-target. (e.g. it makes no sense to have a license field for users.) Some people want fields to apply to multiple targets. I think that would be a useful feature, if it can be implemented sensibly without making it too confusing for users.
- Most probably, yes. Ideally, IMHO, the custom user profile fields would be replaced by the new generic custom field system -- it's silly to have two different custom field systems, especially if one is generic. So the new system should have all the features of the current system.
How do you think that the two proposed implementations break your expected features? Remember that the two proposals only cover storing the actual field data.
We thought so because the two proposals only covered storing the actual field data, without a word about field categories and definitions. It was a bit scaring to see such a focus on this part of the implementation with so little about the main features of the generic custom fields. As if the developer API and the user interface were to be a consequence of the DB schema.
Thanks for clarifying that the current features will probably not disappear.
For example, if you look at the ELIS database schema (linked above in Justin's post), you can see some of the description data that ELIS has, which includes field categories.
The link is dead, and AFAIK always was. The poster seemed to known this URL was wrong because his "master branch got mangled". I browsed a bit in the Github repositories but didn't find the schema mentioned. I finally found an ELIS-related SQL in the code that tests the performance.
Until Justin can provide a better link, try this one: https://github.com/remotelearner/elis.base/blob/MOODLE_22_STABLE/core/elis/core/db/install.xml#L52-165
Basically, the same thing except MOODLE_23_STABLE:
https://github.com/remotelearner/elis.base/blob/MOODLE_23_STABLE/core/elis/core/db/install.xml#L52-165
The elis_field_owner field seems a very good idea, too.
By the way, I think that a "shortname" could be useful for the field_categories too, not only the fields. I'm thinking to the needs of minor edits that can occur on the displayed fields (name), but where an "admin-readable" identifier should stay inchanged.
The elis_field_owner field was created to indicate what plugins were allowed to write to a certain field (and there was a "manual" plugin to indicate that it could be modified by users in the edit form).
In our uses, we didn't come up with a need for a shortname for categories, since they are mostly for end-user/UI use, but I wouldn't see a problem with adding it, if there is a use for it.
Still no clear answer, but this slide-deck from Martin Fowler nicely summarises the issues: http://martinfowler.com/articles/schemaless/
As a Moodle site administrator and user, I care about the database implementation only because it affects performance. Other than that, I'm a lot more interested in how this will be implemented in the UI.
We are a Remote Learner customer, and we have ELIS. We find it very difficult to use, primarily because the added fields are not integrated with the objects they go with (other than User Profile fields). For example, a field added to a course attaches to a separate "Course Instance" object instead of attaching to the actual course, and the fields are not visible within the course itself. This means we can't use these extra fields to store information about each course in a way that other users, e.g. instructors, can easily see it. We can't use existing bulk course creation tools to modify these fields. (Remote Learner's own Integration Point/DataHub utilities require us to update Moodle courses and ELIS course instances separately.) We can't report on those fields easily, either.
I was very excited to see the integration request by François Gannaz in MDL-34634 because it seemed to me that this would integrate with the existing functionality in the same way that the User Profile fields do, which is very usable. Couldn't we get this moving for inclusion in 2.6? Are any of the other proposed back-end implementations that much more efficient? Are we letting perfect be the enemy of good?