Do we need an incremental backup, if so how best do we do it?

Do we need an incremental backup, if so how best do we do it?

by Colin Chambers -
Number of replies: 28
We're thinking of enhancing the backup module by allowing incremental backups. Our need is so that we can support the Offline Moodle project we're working on. We're using the backup to get data in and out of Moodle and we want to reduce the size of the data downloaded.

I'm used to this functionality from SQL server and feel that the community would benefit from this feature but thought I'd ask.

I've looked at the moodle db to see if it has the data to generate an incremental backup and I don't feel it does at the moment. Therefore what should we do to support this.

My gut feel is that the timemodified field used in many tables such as quiz, resource, groups, glossary should be used in more if not all tables. I haven't done an audit of how many tables actually contain this field but plenty seem to.

But is this a good solution. does anyone have any better ideas?
Average of ratings: -
In reply to Colin Chambers

Re: Do we need an incremental backup, if so how best do we do it?

by Martín Langhoff -

Hi Colin - I am right now distracted with webDAV (see http://moodle.org/mod/forum/discuss.php?d=89123 ) but I am drafting some design notes on the side for this.

My thinking at the moment is based on avoiding dependance on metadata such as timecreated/timemodified -- it is a huge undertaking to add those fields to everything, and impossible to guarantee that all the modification paths update them.

Instead, there are some techniques we can use that are based on the identity of the content itself - similar techniques to what GIT uses internally, aggressively hashing everything. The trick will be in findingf the right granularity to break down the export/import to apply the hashing. At that point, we can design something that smartly decides how far behind each client is, and gives them a zipfile that contains exactly what they need. No more, no less.

Just need to find time to flesh out the design a bit wink

In reply to Martín Langhoff

Re: Do we need an incremental backup, if so how best do we do it?

by Colin Chambers -
Thanks martin. Much appreciated. I thought it might be a lot of hassle to add extra fields. We always try to avoid that if possible.

One question given your idea. What parameter(s) would you need passed in to be able to work out the diff. Would dates and time work or are you expecting something else?
In reply to Colin Chambers

Re: Do we need an incremental backup, if so how best do we do it?

by Martín Langhoff -
Of course my plans covers doing things end-to-end so if backup can produce "incrementals", then restore will know what to do with them.

It ain't much of a plan without both parts wink

What we'd need to work out the diff is the identities of the objects the client already has. I know - it sounds kooky-talk. Let me flesh out my plan -- or cheat and read a bit on GIT's internal storage & network protocol to see where I'm borrowing ideas from wink

Note though - we won't be able to mimic that protocol (it's quite involved anyway) but if we build something simple and web-based using similar strategies, I think we'll have a good plan.
In reply to Martín Langhoff

Re: Do we need an incremental backup, if so how best do we do it?

by Colin Chambers -
Just to reiterate what I mentioned by email. Would the restore process accept backups made from an incremental process or would it require further work to make this possible?
In reply to Colin Chambers

Re: Do we need an incremental backup, if so how best do we do it?

by Martín Langhoff -

Incremental Backup & Restore plan

Ended up sitting down and having a serious think of the design for this over the weekend. These are my notes -- nothing definitive, but I think I am mapping out the problem.

I started working out how and identity-based approach would work -- thinking of "identity" in the sense of the "identity of the content, for example, as shown by the SHA1 of a file. You know with certainty that any change to the file will change the SHA1 hash of it (let's leave aside for a moment perceived or real hash collision issues), therefore, as long as the file hasn't changed at all, you will get the exact same SHA1. There are lots of good things that you can do with that -- but the main requirement for this to work is that the output (from restore) is stable. If the files you are hashing are unstable (like an embedded timestamp every time you save them, or platform-dependent newline issues) then the hash will change though the content actually hasn't.

I initially thought we would have to work at the module or resource level -- but once I started fleshing out the plan I realised that if we can guarantee a stable output from backup, we can just work with full backups without changing the format much. The plan's gotten much simpler -- this writeup is long, but the plan is extremely straightforward wink

So the outline looks roughly like this:

  • On the server, provide a basic workflow that goes from the "generate a backup file" to "publish the backup file".
    • Every published backup file is kept for as long as this course is relevant, even if superceded. Internally, it will have a unique name: it's SHA1.
    • If the published backup file will supercede older backup files, we generate static "incremental" against its last N precedessors (or maybe against all). These incrementals can be thought of as "diffs" or "patches" and are identified internall as "from SHA1 to SHA1".
    • Other aspects of publishing also take place - but are out of scope for this.
  • Backup generation MUST have stable output. If content has not changed, then the backup must be identical even if other things may have changed -
    • Ordering of the objects in the XML file must be stable
    • newlines and whitespace in the XML file must be stable
    • any export-time timestamps, or export-operation metadata (user running the export, hostname, tz) must be removed.
    • Might need a stable uid for stable restore purposes (see restore below)
  • Restore must be stable - it must be clear to restore code how to determine if a resource being restored maps to an existing or previously seen resource. This might need a uid scheme similar to Midgard/Repligard. Scenarios:
    • New resource
    • Existing resource - unchanged locally
    • Existing resource - changed locally
    • Seen resource - deleted locally
  • With backup and restore being stable, a simple compact diff/xdelta of the (unpacked) directory tree from the backup files will be efficient. The diff/xdelta does not need to be reversible, and can declare strictly the SHA1 of the file it must be applied to. This means that we can produce extremely efficient deltas by not including context or "content to be replaced". One way xdeltas can just declare byte-ranges where the new content is applied.
  • Review possible xdelta algorithms that we can ship as a small binary for Win32/OSX/Linux or that are simple enough that we can implement in PHP quickly.
    • The xdelta utility on linux/osx is very good for this - and probably has a Win32 port
    • Together with the xdelta generation it is possibly a good idea to have a PHP-based script that can do a XML-aware compare and output the UIDs of objects that seem to have changed. This can make the restore faster / cheaper. It can can run on the server or the client, but itmakes sense to supply it as accessory metadata with the "patch" file. The client is free to ignore it (and load all the restore file regardless) or regenerate it locally.
  • The client always keeps the downloaded backup files, and knows their SHA1 identity. So when it "browses" for course updates and finds an update for course X,it reads the "new" SHA1 and it checks the SHA1 of the last version it downloaded, so it fetches the appropriate - file. If that file does not exist (maybe this client has fallen behind, and we only keep deltas against recent versions) then it will fetch the full backup file.
    • If it has fetched a patchfile, the client makes a copy of the unpacked restore tree, and applies the xdelta.
    • If it has fetched a full zipfile, it will unpack it and run the XML-aware "UIDs changed" script mentioned above.
    • Restore is invoked to execute over the unpacked directory, and can take advantage of the "UIDs changed" hints file to avoid restoring unchanged objects.
    • The latest unpacked directory is kept to serve as the base of the next update. If the operation has been successful, earlier versions can be deleted.

One thing to note is that I don't think that zip generation is stable -- IIRC it embeds a timestamp, and perhaps hostname too. If that's the case, we'll want to compute the SHA1 of the whole backup using a more stable approach, and ignore the literal bits in the zipfile itself. Actually - different compression settings will radically change the bits, so yes we should make the backup id something more similar to a git "tree" identifier.

What is a tree? And how do we compute a tree identifier? It is a directory tree (with as many subdirectories, and files is has). To build a unique identifier of the tree what we do is we

  • Compute the SHA1 of each file
  • Build a manifest with the internal path and the SHA1, like
    • moodle.xml
    • course_files/Media_examples/Test.ppt
  • And then we compute the sha1 of the manifest. That SHA1 is the identity of the "tree"
  • Note that we only pay attention to files - empty directories aren't interesting.

This plan uses some hashing tricks, but core idea is to publish relatively dumb xdelta patches around. This will work great but has some limitations - most notably, if files in course files are moved or renamed, it won't do anything particularly smart. For a rename X to Y there will be an instruction to delete file X, followed by the whole contents of Y. And while we can address that basic case (by doing a "smart compare" of the tree manifests looking for identical hashes), the not-so-uncommon case of move + small change is extremely hard to handle.

We could address that by using the git-apply-patch utility, which handles this gracefully. I need to think more about this angle - there are a few issues with it

  • I want to avoid funny dependencies, like having to ship executables for windows in the client installer. The Win32 and OSX versions of git are excellent, but it's one more thing to support. And while we can probably implement xdelta and related tricks in pure PHP, git-apply-patch is way too complex.
  • AFAIK, the git machinery produces reversible xdeltas -- which means that while we'll save bandwidth in move operations, we will waste it in every other change, as it'll include the "old" data that the patch replaces, as well as the new data. Using a plain unidirectional xdelta, a naive move detector on the tree manifest and training course content maintainers avoid moving large resources is probably a better option.

Edit: There's a nice win32 statically compiled xdelta.exe here http://evanjones.ca/software/xdelta-win32.html

In reply to Martín Langhoff

Re: Do we need an incremental backup, if so how best do we do it?

by Colin Chambers -
Yes, I agree it's obvious that the restore has to work but it's also important to specify this to ensure we include it in our thinking. Assuming too many things often leads to big misunderstandings.

Wow, thanks Martin you've clearly put a lot of thought into this. A couple of things aren't clear to me though so just thought I'd ask.

Firstly it sounds like this approach will create it's own specific backup file type rather than the standard zip containing xml and user files.

Secondly, if that's correct you hint at the need for a client tool to read the xdeltas. this infers that the backup file no longer follows the Moodle standard and thus we'll have two types of backup to support, plus a client tool. Is that correct or a misunderstanding?

My hope was to simply include the changed files and create an xml file with the changed data thus leaving the xml file the same and consequently the restore process the same or very similar at the least.

Finally I just want to clarify what info we need to generate the correct incremental backup. The Moodle server and client can't actually talk to each other. We're not including that level of sophistication at this point. So the student will need to go to a web page and enter the info. I'm guessing we just enter the sha1 of the most recent backup they received. We can find this on the clients file system and print it for the user to copy and paste. the user then enters this into the server via a form.

Is my interpretation correct or is there more to it?

Your work is very good, so please don't take these comments as a criticism. I just like to be clear since your solution is very interesting.
In reply to Colin Chambers

Re: Do we need an incremental backup, if so how best do we do it?

by Martín Langhoff -

This approach keeps the existing backup file format, though we need to make it "more stable" in the sense that no arbitrary bits should change. What we add is a kind of a "patch" file. Imagine you have a course which has seen 2 updates:

  • CourseFoo-xxyyzz.zip # the initial one
  • CourseFoo-yyzzxx.zip # first revision
  • CourseFoo-zzxxyy.zip # second revision, current

at this stage you'll also have 2 "patch" files. For clients that fetched only the initial course, you have:

  • CourseFoo-update-xxyyzz:zzxxyy.zip

This file is a special format -- and the Client Moodle can use it to patch the old file it has, and the result will be the "current" file, CourseFoo-zzxxyy.zip

And you'll also have the patch file for those htat have grabbed the first revision:

  • CourseFoo-update-yyzzxx:zzxxyy.zip

My hope was to simply include the changed files and create an xml file with the changed data

It's not that different wink The problem is that creating an XML file "with the changed data" sounds simple, but is extremely complicated as we need to rig Moodle internals to keep track of what's changed. Using a delta algorithm is way simpler and actually gives you far better compression.

And once you are doing that, you may as well also use a delta for the binary files.

A couple years ago this would have been "a nice but unworkable plan" but these days there are several very reliable and portable binary delta/patch implementations. So all the bits we need exist, are well understood and they are looking rather bored waiting for us to use them wink

Finally I just want to clarify (...) So the student will need to go to a web page and enter the info.

To make it easier, we can craft a URL for the student to click, the URL should include courseid, and SHA1 of what we have. The server side will then issue a redirect to the appropriate patch file to go from your what the client has (SHA1) to the latest version.

If we have a bit of time... it'll be trivial to do this using curl directly from Moodle, and save the user a lot of clicks wink

In reply to Martín Langhoff

Re: Do we need an incremental backup, if so how best do we do it?

by Colin Chambers -
Thanks for the extra detail Martin. I feel it is becoming clearer. I really appreciate the thought you've put in and I can see the copmlexity of creating an xml file that included the differences. We had already investigated this and it does have it's own complexities.

So will we have to ship another tool in the Offline Moodle install to allow it to read the xdeltas. This would obviously create another application that we need to support. If so, would the Moodle community be happy to support it? I think you mentioned this concern earlier in this discussion.
In reply to Martín Langhoff

Re: Do we need an incremental backup, if so how best do we do it?

by Colin Chambers -
Also, given our needs and the complexity of the task we could consider another approach. Just interested in your thoughts. On most of our courses the data that changes the most is that added by students. We could therefore create one backup of the course without student data. Then updates which do include the student data.

This may add extra complexity but it may also reduce it since we've got less data to deal with. We may also find that the date timestamps we require are recorded for this set of data making the task easier.

This solution doesn't account for items that are hidden until a certain date that we didn't include in the original course backup but maybe there is a separate solution for this such as a specific set of methods that retrieves the items that have become 'live' during the period in question.
In reply to Colin Chambers

Re: Do we need an incremental backup, if so how best do we do it?

by Martín Langhoff -

I think exporting only user data would work to a very limited extent.

As you note, the solution doesn't account for hidden items... or any other content changes. So if you have an update to your materials, you'll have to produce it "manually".

and the complexity of the task

It is not as complex as it seems. My plan above is long but simple. I've explained some of the SHA1-based techniques a bit because they aren't well known in the web development community (some moodlers know them already and yawned through my plan I'm sure), but frankly, this stuff is straightforward.

In reply to Colin Chambers

Re: Do we need an incremental backup, if so how best do we do it?

by Martín Langhoff -

Colin,

we've done a bit of a review of the plan, and I am fairly happy with the approach I've described. Don't think we managed to poke any holes at the core plan.

The work to make restore stable (implementing a uid scheme) is not trivial, and it will require walking a good part of the current backup code. The file format will only see a trivial change (backwards compatible).

We are curious about the use cases and the audience that the overall Moodle Client project is meant for at this stage. You have mentioned user data, like forum discussions, that we didn't expect from the original description, and made us think a bit about distribution of sensitive data.

  • You mention distributing forum posts, which necessarily implies entries to the user table - we'll want to filter each of those rows, exporting only some of the fields. How will Moodle internals react to mdl_user rows missing their username and other important fields that are never displayed to the end-user is a bit of a mistery.
  • You may be wanting to "feed back" activity logs to the central server -- so that tutors can see what students have been reading. This is can be spoofed, but is low risk as logs aren't directly used for grading (and if the user is knowledgeable enough to fake it, they can hopefully pass the course.) We would not use the standard restore facility however -- any data read from the user should go through a security-conscious importer, and the restore facility is not ideal for this.
  • Would you expect students to make "local" forum posts (or other local activity like answering quizzes or completing SCORM activities) that are then exported to a "backup" file, and uploaded for restore on the server? Again, we would not use the standard restore facility for security reasons.

While it's not strictly "part of the brief", I worry that user data is a can of worms. As long as we stick to non-user-related content, life is a lot easier wink

Thanks to Matt Clarkson and Jonathan Newman for some excellent poking and prodding at my plans...

In reply to Martín Langhoff

Re: Do we need an incremental backup, if so how best do we do it?

by Colin Chambers -
Good questions. It gets interesting very quickly doesn't it.
  • The user data is one of our biggest concerns. Therefore we may not include it in the intial released version to give ourselves time investigate further implications. My initial vision is to avoid changing the user data unless necessary. Removal of the password is an obvious must and possibly contact details if policy dictates. Our initial tests show that these changes have no impact on how the forum works since the data isn't key to displaying the posts, it's needed for logging on and other processes. Similarly changing the users displayed name to xxxxxxx doesn't change their id. So the forum posts may become confusing because their owners can't be identified but they'll still work as normal. Again this may mean we don't include the forums until we've figured out a better solution.
  • Feedback from the Offline Moodles is expected in time. Thankfully we don't have to provide that at this point. We are simply distributing the content, not consuming any thing at the server end. So any solution must allow for this expansion but it doesn't need to be provided at this point.
  • Again in time we'd like Offline Students to write offline posts which are uploaded to the server but for now content will only go one way. From the server to the client.
I totally agree with you that user data is a can of worms. So may plan is based on prioritising
  1. initial course content
  2. Latest content updates
  3. Initial User data
  4. Latest user data updates.
The reason that I mention user data is that any solution must be aware of the possible future need to include user data.
In reply to Martín Langhoff

Re: Do we need an incremental backup, if so how best do we do it?

by Colin Chambers -
Just thought I'd try to make it a bit clearer how we're approaching this work.

we're committed to releasing our first Offline Moodle solution around July. Therefore we're likely to cut down a lot of the requirements to achieve this. Hence removing support for modules with user content such as forums is an obvious choice.

So standard content is the minimum requirement for the initial release. Anything else is a bonus. However in time our plan is to include these extra modules and the data the contain and require so we have to make sure there is still potential for this.

Now it's not essential that all this extra data is included in these incremental backups. We could use another process and merge it later on. Or have a series of different types of patches we apply to eventually provide the right data. Just like animators on films like Shrek slowly add layer upon layer of different types of information to eventually produce the characters. All this would be part of the maturation of this feature and work would go on as required to keep the process tidy and efficient.

So we don't want to get bogged down in all this extra detail right now. And you're correct to focus on backing up the core content first.
In reply to Martín Langhoff

Re: Do we need an incremental backup, if so how best do we do it?

by Colin Chambers -
I've read the discussion a few times and it's getting clearer. It's getting quite abstract and I like to have diagrams when it gets to this point, just to make sure we're on the same page. So I've created a few diags, firstly to check my understanding of the process and as a visual tool to refine and aid future discussions.

I bet I'm not clear on a bunch of points so this is very much a bit of show and tell. The diags were made in visio and I can post the source files if needed. Just thought I'd see if I'm understanding it right so far.

There are a bunch of assumptions made so the general concept is what I'm interested in.

The first diag is below. It shows how the incremental backups are generated during the nightly cron job
Attachment Offline_Moodle_-_incremental_backup_generation.gif
In reply to Colin Chambers

Re: Do we need an incremental backup, if so how best do we do it?

by Colin Chambers -
This is a diag of how the xdeltas are retrieved by the user and restored to Offline Moodle
Attachment Offline_Moodle_-_restoring_an_incremental_backup.gif
In reply to Colin Chambers

Re: Do we need an incremental backup, if so how best do we do it?

by Colin Chambers -
And here is a document that ties the diags together. it just explains a little about them. It's listed separately because when combined with the diags it goes over the upload file limit. It's all very rough. I'd just like to know if I'm understanding.
In reply to Colin Chambers

Re: Do we need an incremental backup, if so how best do we do it?

by Martín Langhoff -
Hi Colin - great stuff! I'm in a connection-challeged part of the world, and my earlier reply got eaten mid-flight by a grue.

In short, I think the pics are pretty close to what I was planning, but I don't have Visio so if I manage to find 5 minutes of time I'll scribble some notes over them with gimp wink
In reply to Colin Chambers

Re: Do we need an incremental backup, if so how best do we do it?

by Martín Langhoff -

Looking at the graphs - i think they are pretty good, but some things are unclear/ambiguous because it is really confusing to label the xdelta files with just their creation date. It doesn't matter when they were created - what matters is the the "from" and "to", together.

In the first stage of the "generation" graph, the single xdelta file listed there would be 01022008:02022008.xdelta (if we read the name as "from:to.xdelta with the datestamps as DDMMYYYY).

So in the same graph, in the 3rd stage, the xdelta created could be called 02022008:03022008.xdelta . Perhaps it's too much detail for the graph, but in practice, during that same delta generation, I'd generate an extra delta file for users that have skipped a day - 01022008:03022008.xdelta. And I would get rid of the stale 01022008:02022008.xdelta file - no sane client will ask for an xdelta for a day in the past.

So the last box would have 2 backup files + 2 xdeltas but one of the xdelta files would be different from what you have there.

Looking at the "restoring" graph, we won't need to combine or merge xdelta files or anything. There's always a single xdelta file that covers your needs.

(Note here that when we talk about .xdelta files we are simplifying a few things. The "native" xdelta format can handle only a single file "diff" (unlike a unified diff, where we can list changes to many files). So we'll probably "wrap" the xdelta format a little bit, probably similar to what diff. "Our" xdelta format will be a single file that contains "patches" to many files.)

In reply to Martín Langhoff

Re: Do we need an incremental backup, if so how best do we do it?

by Dan Poltawski -
Wow. Impressive Plan!
Trying to compute it and I have a few questions:

Am I right in thinking you are envisaging something like (simply put), a daily 'published' backup file, like we have now, then (perhaps) an hourly deltaSo:

Monday.zip
Monday-01delta
Monday-02delta
....
Monday-23delta
Tuesday.zip
Tueday-01delta
...
Tuesday-23delta
Wednesday.zip

But we might only store the deltas for 3 'published backups'? In which case if we loose connectivity and are in a a mid-delta backup state (e.g. Monday-02delta) where we no longer keep deltas, how do we get to the next published backup? Just examine each files' sha1 work it out manually along the tree?

How would this actually work with database data? If we change a resource name, do we add to the backup xml:
* Resource name = foo
* Change resource name foo -> bar
* Change resource name bar - > baz
??
(How do we apply these changes to the database?)

Also, is there an advantage to storing the database content separately to the files (which I guess change very little, in comparison to constantly changing DB) so that only the database data has to be unpacked?
In reply to Dan Poltawski

Re: Do we need an incremental backup, if so how best do we do it?

by Martín Langhoff -

I've gotten confused with your example of deltas. To clarify - on the server...

On Monday we have

  • monday.zip

On tuesday, we have

  • monday.zip
  • tuesday.zip
  • monday-to-tuesday-delta.zip

On Wednesday...

  • monday.zip
  • tuesday.zip
  • wednesday.zip
  • monday-to-wednesday-delta.zip
  • tuesday-to-wednesday-delta.zip

...and we've deleted monday-to-tuesday-delta.zip as there's no point to upgrade to a stale revision.

You say "if we lose connectivity" -- if guess you mean if the client loses connectivity while downloading the delta file, then the process aborts before any attempt at applying the delta happens.

After a successful update, the client should not keep deltas. A client that updates on monday, tuesday and wednesday should only keep one "backup" zipfile or unpacked directory where all the "patches" are applied. So it will be identical (when unpacked) to wednesday.zip (zipfiles aren't stable, so the zipfile itself may not match bit-by-bit).

In reply to Martín Langhoff

Re: Do we need an incremental backup, if so how best do we do it?

by Colin Chambers -
Ok I think I see how the patching works. Does the sequence continue like this then
  • monday.zip
  • tuesday.zip
  • wednesday.zip
  • thursday.zip
  • monday-to-thursday-delta.zip
  • tuesday-to-thursday-delta.zip
  • wednesday-to-thursday-delta.zip
It seems that the xdeltas will get exponentially larger. My concern is that If the course is live for say 9 months there is a chance that the file storage required for (9x30 days = 270) 270 xdeltas could be very large.

Also it seems that the set of xdeltas must be reproduced every night and again this would take exponentially longer as the course remains live. If this happens on 1000 courses surely it's going to break the server.

I'm wondering why it's not possible to provide just one xdelta for each day that contains the diffs of the last day e.g
  • monday.zip
  • monday-to-tuesday-delta.zip
  • tuesday-to-wednesday-delta.zip
  • wednesday-to-thursday-delta.zip
Then the restore process would just have to loop through each xdelta as though its comparing
  • monday.zip
  • monday-to-tuesday-delta.zip
  • tuesday.zip
  • tuesday-to-wednesday-delta.zip
  • wednesday.zip
  • wednesday-to-thursday-delta.zip
Sure, this would put the load on the client but then they've only got to restore the one course and if it'll only be a pain if they haven't updated for a long time.

It also means that we may be able to download the xdeltas one at a time so that if the connection breaks we'll still have some xdeltas and can run some updates. So the next time they update they'll have less to download. I know some people have limits on what they can download per session so this would help them get round that.

I know this means using the xdeltas in a different way but I think it provides a more flexible solution and I haven't heard an explanation for why it can't be done.


In reply to Colin Chambers

Re: Do we need an incremental backup, if so how best do we do it?

by Martín Langhoff -
Your thursday view is correct. In terms of the delta generation strategy, we can tune this later. I am not too attached to the "Delta everything to the latest" but it is more efficient network-wise (if you change a large file, and then revert it...), and it is also somewhat safer.

The "safer" part is a somewhat-irrational thing on my part - the xdelta application scheme checks that the md5 hash of the resulting file so we know 100% that we have the expected file.

Still, a chain of xdelta patches *is* more error prone than applying a single xdelta patch - but we know this machinery *will* catch the error if it happens wink
In reply to Martín Langhoff

Re: Do we need an incremental backup, if so how best do we do it?

by Colin Chambers -
Ok, just wanted to put those concerns out there. Obviously server load when creating these deltas along with download size is as important as network efficiency. Just wanted to ensure we had scope to refine this and give you time to consider the points.
In reply to Colin Chambers

Re: Do we need an incremental backup, if so how best do we do it?

by Martín Langhoff -

Cool. One nitpick though wink

server load when creating these deltas along with download size is as important as network efficiency

They are not equal - network efficiency is more important.

In this case, server load for the delta creation is low -- but even if it wasn't... it is still done on a schedule, and we can throw hardware at it. And if it takes 30s instead of 3s it only needs one user to be patient: the editor that is pushing the "publish update" button. And if it was really slow we'd schedule it to run on a separate machine.

Network efficiency (as in: "having smaller files") affects every end user, and every download, and we can't workaround it if we end up with huge files.

The scheme I outlined is good in both aspects (I think!), and there are always tradeoffs. But a having a central server do a bit more work to benefit thousands of clients is a naturally good tradeoff in my book wide eyes

In reply to Martín Langhoff

Re: Do we need an incremental backup, if so how best do we do it?

by Colin Chambers -
Now I see your point. I misunderstood your meaning of network efficiency. If it means generating smaller xdeltas then I'm all for that. I mentioned download size at the same time for this reason. Thanks for clearing that up.
In reply to Martín Langhoff

Re: Do we need an incremental backup, if so how best do we do it?

by Colin Chambers -
Just been looking into xdeltas and one question seems important. Is xdelta and it's tools going to be around for a long time and will support for it also be high quality for a long time. Just asking this since our entire concept is based on xdelta.

The other way of looking at it is either;
  1. There are other diff algorythms we can switch to with little impact on the overall solution
  2. Once we get a working solution we won't need any upgrades or further support
  3. or upgrades aren't that difficult, all you need is a half decent developer
I'm hoping the answer is option 1 and that we build our code with that in mind.
In reply to Colin Chambers

Re: Do we need an incremental backup, if so how best do we do it?

by Colin Chambers -
Okay, so I just checked xdelta.org and a bit of wikipedia, not perfect for research buyt they'll have to do cos I couldn't find much else. I found out that xdelta has been around since 1997 so it's been around for almost 11 years and is based and follows RFC 3284, specification for the VCDIFF generic delta encoding format.

I also found out that its a diff algorythm that 'efficiently[2] supports files as large as 264 bytes on all platforms, making xdelta3 suitable for use with large backup files'. So I can see why it's the best fit.

It also runs on Unix, Linux, BSD, and Microsoft Windows-based systems so fits the Moodle ethos.

Sorry to be such a beginner on this but I just felt the need for a little background research and thought it doesn't hurt to post it here for others who aren't so familiar with the tool.


In reply to Colin Chambers

Re: Do we need an incremental backup, if so how best do we do it?

by Martín Langhoff -
I think you've explained why xdelta is they best choice far better than I would have wink

I knew a bit about xdelta and other delta algorithms before starting to look at this project, and did a bit of extra research -- xdelta is by far the best thing we can base this on, IMHO wink

git (that oddly named scm) uses xdelta internally, and when I mentioned that I was planning on writing a wrapper to xdelta, they've immediately offered to help. They propose a cut-down version of the git "diff" and "apply" tools. I'm not 100% convinced at the moment, but it's an alternative.