I'm used to this functionality from SQL server and feel that the community would benefit from this feature but thought I'd ask.
I've looked at the moodle db to see if it has the data to generate an incremental backup and I don't feel it does at the moment. Therefore what should we do to support this.
My gut feel is that the timemodified field used in many tables such as quiz, resource, groups, glossary should be used in more if not all tables. I haven't done an audit of how many tables actually contain this field but plenty seem to.
But is this a good solution. does anyone have any better ideas?
Hi Colin - I am right now distracted with webDAV (see http://moodle.org/mod/forum/discuss.php?d=89123 ) but I am drafting some design notes on the side for this.
My thinking at the moment is based on avoiding dependance on metadata such as timecreated/timemodified -- it is a huge undertaking to add those fields to everything, and impossible to guarantee that all the modification paths update them.
Instead, there are some techniques we can use that are based on the identity of the content itself - similar techniques to what GIT uses internally, aggressively hashing everything. The trick will be in findingf the right granularity to break down the export/import to apply the hashing. At that point, we can design something that smartly decides how far behind each client is, and gives them a zipfile that contains exactly what they need. No more, no less.
Just need to find time to flesh out the design a bit
One question given your idea. What parameter(s) would you need passed in to be able to work out the diff. Would dates and time work or are you expecting something else?
It ain't much of a plan without both parts
What we'd need to work out the diff is the identities of the objects the client already has. I know - it sounds kooky-talk. Let me flesh out my plan -- or cheat and read a bit on GIT's internal storage & network protocol to see where I'm borrowing ideas from
Note though - we won't be able to mimic that protocol (it's quite involved anyway) but if we build something simple and web-based using similar strategies, I think we'll have a good plan.
Incremental Backup & Restore plan
Ended up sitting down and having a serious think of the design for this over the weekend. These are my notes -- nothing definitive, but I think I am mapping out the problem.
I started working out how and identity-based approach would work -- thinking of "identity" in the sense of the "identity of the content, for example, as shown by the SHA1 of a file. You know with certainty that any change to the file will change the SHA1 hash of it (let's leave aside for a moment perceived or real hash collision issues), therefore, as long as the file hasn't changed at all, you will get the exact same SHA1. There are lots of good things that you can do with that -- but the main requirement for this to work is that the output (from restore) is stable. If the files you are hashing are unstable (like an embedded timestamp every time you save them, or platform-dependent newline issues) then the hash will change though the content actually hasn't.
I initially thought we would have to work at the module or resource level -- but once I started fleshing out the plan I realised that if we can guarantee a stable output from backup, we can just work with full backups without changing the format much. The plan's gotten much simpler -- this writeup is long, but the plan is extremely straightforward
So the outline looks roughly like this:
- On the server, provide a basic workflow that goes from the "generate a backup file" to "publish the backup file".
- Every published backup file is kept for as long as this course is relevant, even if superceded. Internally, it will have a unique name: it's SHA1.
- If the published backup file will supercede older backup files, we generate static "incremental" against its last N precedessors (or maybe against all). These incrementals can be thought of as "diffs" or "patches" and are identified internall as "from SHA1 to SHA1".
- Other aspects of publishing also take place - but are out of scope for this.
- Backup generation MUST have stable output. If content has not changed, then the backup must be identical even if other things may have changed -
- Restore must be stable - it must be clear to restore code how to determine if a resource being restored maps to an existing or previously seen resource. This might need a uid scheme similar to Midgard/Repligard. Scenarios:
- New resource
- Existing resource - unchanged locally
- Existing resource - changed locally
- Seen resource - deleted locally
- With backup and restore being stable, a simple compact diff/xdelta of the (unpacked) directory tree from the backup files will be efficient. The diff/xdelta does not need to be reversible, and can declare strictly the SHA1 of the file it must be applied to. This means that we can produce extremely efficient deltas by not including context or "content to be replaced". One way xdeltas can just declare byte-ranges where the new content is applied.
- Review possible xdelta algorithms that we can ship as a small binary for Win32/OSX/Linux or that are simple enough that we can implement in PHP quickly.
- The xdelta utility on linux/osx is very good for this - and probably has a Win32 port
- Together with the xdelta generation it is possibly a good idea to have a PHP-based script that can do a XML-aware compare and output the UIDs of objects that seem to have changed. This can make the restore faster / cheaper. It can can run on the server or the client, but itmakes sense to supply it as accessory metadata with the "patch" file. The client is free to ignore it (and load all the restore file regardless) or regenerate it locally.
- The client always keeps the downloaded backup files, and knows their SHA1 identity. So when it "browses" for course updates and finds an update for course X,it reads the "new" SHA1 and it checks the SHA1 of the last version it downloaded, so it fetches the appropriate - file. If that file does not exist (maybe this client has fallen behind, and we only keep deltas against recent versions) then it will fetch the full backup file.
- If it has fetched a patchfile, the client makes a copy of the unpacked restore tree, and applies the xdelta.
- If it has fetched a full zipfile, it will unpack it and run the XML-aware "UIDs changed" script mentioned above.
- Restore is invoked to execute over the unpacked directory, and can take advantage of the "UIDs changed" hints file to avoid restoring unchanged objects.
- The latest unpacked directory is kept to serve as the base of the next update. If the operation has been successful, earlier versions can be deleted.
One thing to note is that I don't think that zip generation is stable -- IIRC it embeds a timestamp, and perhaps hostname too. If that's the case, we'll want to compute the SHA1 of the whole backup using a more stable approach, and ignore the literal bits in the zipfile itself. Actually - different compression settings will radically change the bits, so yes we should make the backup id something more similar to a git "tree" identifier.
What is a tree? And how do we compute a tree identifier? It is a directory tree (with as many subdirectories, and files is has). To build a unique identifier of the tree what we do is we
- Compute the SHA1 of each file
- Build a manifest with the internal path and the SHA1, like
- And then we compute the sha1 of the manifest. That SHA1 is the identity of the "tree"
- Note that we only pay attention to files - empty directories aren't interesting.
This plan uses some hashing tricks, but core idea is to publish relatively dumb xdelta patches around. This will work great but has some limitations - most notably, if files in course files are moved or renamed, it won't do anything particularly smart. For a rename X to Y there will be an instruction to delete file X, followed by the whole contents of Y. And while we can address that basic case (by doing a "smart compare" of the tree manifests looking for identical hashes), the not-so-uncommon case of move + small change is extremely hard to handle.
We could address that by using the git-apply-patch utility, which handles this gracefully. I need to think more about this angle - there are a few issues with it
- I want to avoid funny dependencies, like having to ship executables for windows in the client installer. The Win32 and OSX versions of git are excellent, but it's one more thing to support. And while we can probably implement xdelta and related tricks in pure PHP, git-apply-patch is way too complex.
- AFAIK, the git machinery produces reversible xdeltas -- which means that while we'll save bandwidth in move operations, we will waste it in every other change, as it'll include the "old" data that the patch replaces, as well as the new data. Using a plain unidirectional xdelta, a naive move detector on the tree manifest and training course content maintainers avoid moving large resources is probably a better option.
Edit: There's a nice win32 statically compiled xdelta.exe here http://evanjones.ca/software/xdelta-win32.html
Wow, thanks Martin you've clearly put a lot of thought into this. A couple of things aren't clear to me though so just thought I'd ask.
Firstly it sounds like this approach will create it's own specific backup file type rather than the standard zip containing xml and user files.
Secondly, if that's correct you hint at the need for a client tool to read the xdeltas. this infers that the backup file no longer follows the Moodle standard and thus we'll have two types of backup to support, plus a client tool. Is that correct or a misunderstanding?
My hope was to simply include the changed files and create an xml file with the changed data thus leaving the xml file the same and consequently the restore process the same or very similar at the least.
Finally I just want to clarify what info we need to generate the correct incremental backup. The Moodle server and client can't actually talk to each other. We're not including that level of sophistication at this point. So the student will need to go to a web page and enter the info. I'm guessing we just enter the sha1 of the most recent backup they received. We can find this on the clients file system and print it for the user to copy and paste. the user then enters this into the server via a form.
Is my interpretation correct or is there more to it?
Your work is very good, so please don't take these comments as a criticism. I just like to be clear since your solution is very interesting.
This approach keeps the existing backup file format, though we need to make it "more stable" in the sense that no arbitrary bits should change. What we add is a kind of a "patch" file. Imagine you have a course which has seen 2 updates:
- CourseFoo-xxyyzz.zip # the initial one
- CourseFoo-yyzzxx.zip # first revision
- CourseFoo-zzxxyy.zip # second revision, current
at this stage you'll also have 2 "patch" files. For clients that fetched only the initial course, you have:
This file is a special format -- and the Client Moodle can use it to patch the old file it has, and the result will be the "current" file, CourseFoo-zzxxyy.zip
And you'll also have the patch file for those htat have grabbed the first revision:
My hope was to simply include the changed files and create an xml file with the changed data
It's not that different The problem is that creating an XML file "with the changed data" sounds simple, but is extremely complicated as we need to rig Moodle internals to keep track of what's changed. Using a delta algorithm is way simpler and actually gives you far better compression.
And once you are doing that, you may as well also use a delta for the binary files.
A couple years ago this would have been "a nice but unworkable plan" but these days there are several very reliable and portable binary delta/patch implementations. So all the bits we need exist, are well understood and they are looking rather bored waiting for us to use them
Finally I just want to clarify (...) So the student will need to go to a web page and enter the info.
To make it easier, we can craft a URL for the student to click, the URL should include courseid, and SHA1 of what we have. The server side will then issue a redirect to the appropriate patch file to go from your what the client has (SHA1) to the latest version.
If we have a bit of time... it'll be trivial to do this using curl directly from Moodle, and save the user a lot of clicks
So will we have to ship another tool in the Offline Moodle install to allow it to read the xdeltas. This would obviously create another application that we need to support. If so, would the Moodle community be happy to support it? I think you mentioned this concern earlier in this discussion.
This may add extra complexity but it may also reduce it since we've got less data to deal with. We may also find that the date timestamps we require are recorded for this set of data making the task easier.
This solution doesn't account for items that are hidden until a certain date that we didn't include in the original course backup but maybe there is a separate solution for this such as a specific set of methods that retrieves the items that have become 'live' during the period in question.
I think exporting only user data would work to a very limited extent.
As you note, the solution doesn't account for hidden items... or any other content changes. So if you have an update to your materials, you'll have to produce it "manually".
and the complexity of the task
It is not as complex as it seems. My plan above is long but simple. I've explained some of the SHA1-based techniques a bit because they aren't well known in the web development community (some moodlers know them already and yawned through my plan I'm sure), but frankly, this stuff is straightforward.
we've done a bit of a review of the plan, and I am fairly happy with the approach I've described. Don't think we managed to poke any holes at the core plan.
The work to make restore stable (implementing a uid scheme) is not trivial, and it will require walking a good part of the current backup code. The file format will only see a trivial change (backwards compatible).
We are curious about the use cases and the audience that the overall Moodle Client project is meant for at this stage. You have mentioned user data, like forum discussions, that we didn't expect from the original description, and made us think a bit about distribution of sensitive data.
- You mention distributing forum posts, which necessarily implies entries to the user table - we'll want to filter each of those rows, exporting only some of the fields. How will Moodle internals react to mdl_user rows missing their username and other important fields that are never displayed to the end-user is a bit of a mistery.
- You may be wanting to "feed back" activity logs to the central server -- so that tutors can see what students have been reading. This is can be spoofed, but is low risk as logs aren't directly used for grading (and if the user is knowledgeable enough to fake it, they can hopefully pass the course.) We would not use the standard restore facility however -- any data read from the user should go through a security-conscious importer, and the restore facility is not ideal for this.
- Would you expect students to make "local" forum posts (or other local activity like answering quizzes or completing SCORM activities) that are then exported to a "backup" file, and uploaded for restore on the server? Again, we would not use the standard restore facility for security reasons.
While it's not strictly "part of the brief", I worry that user data is a can of worms. As long as we stick to non-user-related content, life is a lot easier
Thanks to Matt Clarkson and Jonathan Newman for some excellent poking and prodding at my plans...
- The user data is one of our biggest concerns. Therefore we may not include it in the intial released version to give ourselves time investigate further implications. My initial vision is to avoid changing the user data unless necessary. Removal of the password is an obvious must and possibly contact details if policy dictates. Our initial tests show that these changes have no impact on how the forum works since the data isn't key to displaying the posts, it's needed for logging on and other processes. Similarly changing the users displayed name to xxxxxxx doesn't change their id. So the forum posts may become confusing because their owners can't be identified but they'll still work as normal. Again this may mean we don't include the forums until we've figured out a better solution.
- Feedback from the Offline Moodles is expected in time. Thankfully we don't have to provide that at this point. We are simply distributing the content, not consuming any thing at the server end. So any solution must allow for this expansion but it doesn't need to be provided at this point.
- Again in time we'd like Offline Students to write offline posts which are uploaded to the server but for now content will only go one way. From the server to the client.
- initial course content
- Latest content updates
- Initial User data
- Latest user data updates.
we're committed to releasing our first Offline Moodle solution around July. Therefore we're likely to cut down a lot of the requirements to achieve this. Hence removing support for modules with user content such as forums is an obvious choice.
So standard content is the minimum requirement for the initial release. Anything else is a bonus. However in time our plan is to include these extra modules and the data the contain and require so we have to make sure there is still potential for this.
Now it's not essential that all this extra data is included in these incremental backups. We could use another process and merge it later on. Or have a series of different types of patches we apply to eventually provide the right data. Just like animators on films like Shrek slowly add layer upon layer of different types of information to eventually produce the characters. All this would be part of the maturation of this feature and work would go on as required to keep the process tidy and efficient.
So we don't want to get bogged down in all this extra detail right now. And you're correct to focus on backing up the core content first.
I bet I'm not clear on a bunch of points so this is very much a bit of show and tell. The diags were made in visio and I can post the source files if needed. Just thought I'd see if I'm understanding it right so far.
There are a bunch of assumptions made so the general concept is what I'm interested in.
The first diag is below. It shows how the incremental backups are generated during the nightly cron job
In short, I think the pics are pretty close to what I was planning, but I don't have Visio so if I manage to find 5 minutes of time I'll scribble some notes over them with gimp
Looking at the graphs - i think they are pretty good, but some things are unclear/ambiguous because it is really confusing to label the xdelta files with just their creation date. It doesn't matter when they were created - what matters is the the "from" and "to", together.
In the first stage of the "generation" graph, the single xdelta file listed there would be 01022008:02022008.xdelta (if we read the name as "from:to.xdelta with the datestamps as DDMMYYYY).
So in the same graph, in the 3rd stage, the xdelta created could be called 02022008:03022008.xdelta . Perhaps it's too much detail for the graph, but in practice, during that same delta generation, I'd generate an extra delta file for users that have skipped a day - 01022008:03022008.xdelta. And I would get rid of the stale 01022008:02022008.xdelta file - no sane client will ask for an xdelta for a day in the past.
So the last box would have 2 backup files + 2 xdeltas but one of the xdelta files would be different from what you have there.
Looking at the "restoring" graph, we won't need to combine or merge xdelta files or anything. There's always a single xdelta file that covers your needs.
(Note here that when we talk about .xdelta files we are simplifying a few things. The "native" xdelta format can handle only a single file "diff" (unlike a unified diff, where we can list changes to many files). So we'll probably "wrap" the xdelta format a little bit, probably similar to what diff. "Our" xdelta format will be a single file that contains "patches" to many files.)
Trying to compute it and I have a few questions:
Am I right in thinking you are envisaging something like (simply put), a daily 'published' backup file, like we have now, then (perhaps) an hourly deltaSo:
But we might only store the deltas for 3 'published backups'? In which case if we loose connectivity and are in a a mid-delta backup state (e.g. Monday-02delta) where we no longer keep deltas, how do we get to the next published backup? Just examine each files' sha1 work it out manually along the tree?
How would this actually work with database data? If we change a resource name, do we add to the backup xml:
* Resource name = foo
* Change resource name foo -> bar
* Change resource name bar - > baz
(How do we apply these changes to the database?)
Also, is there an advantage to storing the database content separately to the files (which I guess change very little, in comparison to constantly changing DB) so that only the database data has to be unpacked?
I've gotten confused with your example of deltas. To clarify - on the server...
On Monday we have
On tuesday, we have
...and we've deleted monday-to-tuesday-delta.zip as there's no point to upgrade to a stale revision.
You say "if we lose connectivity" -- if guess you mean if the client loses connectivity while downloading the delta file, then the process aborts before any attempt at applying the delta happens.
After a successful update, the client should not keep deltas. A client that updates on monday, tuesday and wednesday should only keep one "backup" zipfile or unpacked directory where all the "patches" are applied. So it will be identical (when unpacked) to wednesday.zip (zipfiles aren't stable, so the zipfile itself may not match bit-by-bit).
Also it seems that the set of xdeltas must be reproduced every night and again this would take exponentially longer as the course remains live. If this happens on 1000 courses surely it's going to break the server.
I'm wondering why it's not possible to provide just one xdelta for each day that contains the diffs of the last day e.g
It also means that we may be able to download the xdeltas one at a time so that if the connection breaks we'll still have some xdeltas and can run some updates. So the next time they update they'll have less to download. I know some people have limits on what they can download per session so this would help them get round that.
I know this means using the xdeltas in a different way but I think it provides a more flexible solution and I haven't heard an explanation for why it can't be done.
The "safer" part is a somewhat-irrational thing on my part - the xdelta application scheme checks that the md5 hash of the resulting file so we know 100% that we have the expected file.
Still, a chain of xdelta patches *is* more error prone than applying a single xdelta patch - but we know this machinery *will* catch the error if it happens
Cool. One nitpick though
server load when creating these deltas along with download size is as important as network efficiency
They are not equal - network efficiency is more important.
In this case, server load for the delta creation is low -- but even if it wasn't... it is still done on a schedule, and we can throw hardware at it. And if it takes 30s instead of 3s it only needs one user to be patient: the editor that is pushing the "publish update" button. And if it was really slow we'd schedule it to run on a separate machine.
Network efficiency (as in: "having smaller files") affects every end user, and every download, and we can't workaround it if we end up with huge files.
The scheme I outlined is good in both aspects (I think!), and there are always tradeoffs. But a having a central server do a bit more work to benefit thousands of clients is a naturally good tradeoff in my book
The other way of looking at it is either;
- There are other diff algorythms we can switch to with little impact on the overall solution
- Once we get a working solution we won't need any upgrades or further support
- or upgrades aren't that difficult, all you need is a half decent developer
I also found out that its a diff algorythm that 'efficiently supports files as large as 264 bytes on all platforms, making xdelta3 suitable for use with large backup files'. So I can see why it's the best fit.
It also runs on Unix, Linux, BSD, and Microsoft Windows-based systems so fits the Moodle ethos.
Sorry to be such a beginner on this but I just felt the need for a little background research and thought it doesn't hurt to post it here for others who aren't so familiar with the tool.
I knew a bit about xdelta and other delta algorithms before starting to look at this project, and did a bit of extra research -- xdelta is by far the best thing we can base this on, IMHO
git (that oddly named scm) uses xdelta internally, and when I mentioned that I was planning on writing a wrapper to xdelta, they've immediately offered to help. They propose a cut-down version of the git "diff" and "apply" tools. I'm not 100% convinced at the moment, but it's an alternative.