Hi Aurelia,
I think this is a fascinating proposal...but I'm wondering if web scraping is the way to go.
Any given Moodle site is really experienced by a "user", so your scraper agent would need to be a "user" of some sort (presumably a student...but that's not a given), and it will be subjected to the rules that apply to a student. So this could mean that there would be content your user isn't eligible to see, and someone would need to ensure that it is given all of the "correct" set up for the content you'd like to archive.
On top of this a scraper's going to pick up all of the Moodle UI in addition to the actual content.
You wouldn't need to create the scraper as a "Moodle"
plugin either, in a traditional sense it would simply follow all of the links *that it can see* and index the content.
Creating a *plugin* however does mean that in theory you'd have the direct access to the raw content in all of the Moodle
database, and you wouldn't need to do "scraping". However somehow your plugin would need to be aware of how every plugin (that you're interested in) represents it's data internally...
If the scraper is able to grab all of the HTML and
CSS and related files, I suspect that it still wouldn't capture the backend processes that run that would enable a "standalone" copy to "function" (i.e. anything that makes further calls via
AJAX). From what I can see from the Quiz archiver, this essentially gets the quiz attempt and is effectively re-writing the quiz attempt data into a "new" static format, which means it has some understanding of how the quiz module works internally (it was just short squint at the code).
It may be worth looking at Moodle's features around Subject Access Requests (this supports the
GDPR requirement to be able to give a specify user all of *their* data), and there could be some interesting approaches in that sub-system that could be mapped over to your requirements, which is (as I understand) not about the "user's" content but more arbitrary content.
Every moodle is "obliged" to provide in it's privacy code (
https://moodledev.io/docs/5.0/apis/subsystems/privacy) an "export_user_data()" function, there isn't as far as I'm aware an "export_activity_for_archive()" function, but maybe there
should be so that each module must be inherently responsible for implementing a representation of it's
activities in a form that
is suitable for archival purposes...
Archival of the assignment submitted through the assignment activity is an interesting one, as there's a question as to whether "you'd" have the copyright to hold a copy of that work. I'm assuming if the "you" is the instution that owns the Moodle it may have a copyright assignment made as part of the student's submission but that could vary between institutions, but for instance we have the position that the student retains the copyright to their submission by default, so any further use of it needs to be clarified (just mention this as this feature for student submitted work may be well served with a switch to turn it on / off depending on the executing instution's situation).
I think option 2 has it's own issues as well, as you'd be capturing just 1 user's playback, or potentially many users playbacks, but I don't see that you'd get all potential user playbacks, so it would always be representative rather than definitive.
Anyway, I'm not entirely sure if this is entirely useful, but I look forward to hearing more about this activity!