Moodle announces Project Inspire! Integrated Learning Analytics Tools

Re: Moodle announces Project Inspire! Integrated Learning Analytics Tools

by Vera Friederichs -
Number of replies: 1

I like decision trees and random forests as well but, at least for what I've seen, they have difficulties to generalise complex relations between features and having no control over the input features a neural net should be less prone to overfitting and it should be able to generalise better; I agree that random forests would partly solve this but I don't know to what extent.


Random forests are quite "easy" in terms of modeling. You can (almost) throw in as many independent variables as you want without having to face serious problems like in linear regression where variables must not have a high correlation. If you do cross-validation in the training (splitting your data in several subsets and use one set for training and the others for validation and then changing the training set, etc.) overfitting should not be an issue.

The R xgboost package is just more robust to use, because it has a better handling of missing values, and models are smaller in file size and training is faster.


I am very curious about the results of analyzing submitted data sets and creating models for different situations of moodle use.

I have to say that I am working for a commercial company selling software for risk prediction in moodle, i.e. exactly the same goals as Project Inspire. My interest is of course to understand how other software works, and my manager is OK with me sharing openly our concepts and insights, etc. I do not think that there needs to be competition between open source software and commercial software (actually I am a big fan of the open source idea), because the crucial point here is the modeling. Our clients pay us to do it for them with their data, evaluate the model after the term has finished, etc. And with open source software they have to pay their own person spending quite some time doing (and first understanding) the modeling.

I cannot submit any data (because it is the clients' data). I can only share the experience that we were initially hoping that after some initial clients we would also kind of "converge" to a general model, but all the clients used moodle in different ways which made it necessary to make individual adjustments, up to refusing to create a model, if the client did not use the gradebook "properly" (if they did not have a minimum number of graded items in moodle), or if in hybrid courses the online part is just too low. But of course, the beauty of open source is, that a lot more people can contribute, and if a much higher number of people submit their data, it may still be possible to create satisfying general models.

In reply to Vera Friederichs

Re: Moodle announces Project Inspire! Integrated Learning Analytics Tools

by David Monllaó -

Some months ago I setup a new model to test that the API is able to work at different context levels, it was about detection of late assignment submissions on specific assignment activities; it is nothing serious but allowed me to see how models generalize to different sites. It is easier to play with this model than with prevention of students at risk because in this case the label is clearly defined and easily calculated; the classes are well balanced which also helps. Despite I used 2 datasets and the courses they contain are not very well structured nor very clean, despite some indicators are partly coupled to the label and despite most assignments are submitted the day before the due date (or the same day hehehe) the model is able to predict late assignment submissions 2 and 4 days before the due date with around an 75-80% accuracy using test data from the same site (not used for training obviously) and around a 70-75% using another site as test data. Only 1 site data was used to train the model; the difference between these 2 accuracies should decrease as we train using new sites datasets although unseen sites accuracy will likely be always lower. 

The student indicators this model uses are:

  • How close to the due date the student submitted other assignments (also considers if he didn't submit at all)
  • Same for quizzes and their time close dates
  • Same for choice activities and their time close dates
  • Weighted number of write actions on the analysed activity
  • Weighted number of read actions on the analysed activity

The key to make this model generalise well to other sites seemed to be to use extra indicators to add context to the student indicators:

  • Is activity completion enabled for that activity?
  • How much weight does the activity have in the gradebook?
  • Is grade to pass set for that activity?

As I said above this is just an example and adding more context indicators should help this model generalise even further. This is not part of the current HQ priorities.

This model is available in https://github.com/dmonllao/moodle-local_testanalytics/tree/late-assign-submissions (late-assign-submissions branch). Again, this is nothing serious so don't expect much documentation but you should be able to install the plugin and evaluate this late assignments submissions model using your site data without any problem. If you are interested in playing with this model and given that you are a data scientist I would recommend you to evaluate the model using a few of your client sites (using --timesplitting='\local_testanalytics\analytics\time_splitting\close_to_deadline') and, instead of using the predictions processor that is included in moodle, download the resulting .csv files (you can use https://gist.github.com/dmonllao/d1db52b11c9ca00e76ab8ddcb95c6c93 for it) This way you can use your own algorithms to compare how well the model generalises.

I got the results presented above using neural nets with adam optimization, dropout regularization, tanh activation and a decaying learning rate from a generous 0.5 to 0.005. You can reproduce them with https://github.com/dmonllao/tensorflow-performance-playground (all options documented in python train.py --help, only a subset of them in README.md) the results will always depend on the datasets you use but given that the datasets I used were significantly different I wouldn't expect much differences between using unseen test data from one of the training datasets vs using test data from a completely different site.