Lessons learnt from the crash

Lessons learnt from the crash

by Visvanath Ratnaweera -
Number of replies: 8
Picture of Particularly helpful Moodlers Picture of Translators
What have we learnt from the memory lapse moodle.org had?
Average of ratings: -
In reply to Visvanath Ratnaweera

Re: Lessons learnt from the crash

by Matt Bury -
Picture of Plugin developers

memory lapse

There was a memory lapse?!!

In reply to Matt Bury

Re: Lessons learnt from the crash

by Visvanath Ratnaweera -
Picture of Particularly helpful Moodlers Picture of Translators
Of course it is a shock for the person (or creature) involved. But what the poor thing often fail to realize is the fact that it had been given a new lease of life, to the tune of 24 days in this case!
In reply to Visvanath Ratnaweera

Re: Lessons learnt from the crash

by Derek Chirnside -

I think we need a better communications system to manage things around an event like this, before, during and afterwards.  

I originally thought Twitter could help with keeping things up to date.  I think the length of time that Moodle.org was down in "read only" mode shows we do need a plan B.  Maybe setting up a short term forum for immediate questions and requests for help, even if it was not part of the main forums and disappeared afterwards.  There are ways of harvesting data and information that has a life longer than just the day it was posted.

-Derek

PS, as of now I have stopped receiving Moodle.org notifications again, at least the second tme in a few days.  Does anyone know if there is a fix for this?  

And the new editor is truncated on my screen as I type, unless I reduce the font size in my browser, in Chrome.  I guess with 2.6 out it's time to do some more crash testing.

 

In reply to Visvanath Ratnaweera

Re: Lessons learnt from the crash

by Matt Spurrier -

Hi All,

As you're no doubt aware, moodle.org recently suffered an unfortunate outage
and subsequent data loss last week. I take some responsibility for the incident and would like to apologise and reassure the community of our absolute commitment in providing a stable, reliable discussion and resource platform for Moodlers everywhere. In doing so, I’d like to explain briefly what happened and detail the measures we have taken to prevent similar issues from occurring in the future.

Cause of the incident

The cause of the outage itself has been traced to an operating system package upgrade that had a detrimental effect to the database running moodle.org. This resulted in the database being unsafely stopped and subsequent corruption to the underlying database structure.

Attempts in trying to resolve the issue using vendor provided tools failed, and we tried to instead restore from our hourly backups that would minimise data loss.

Unfortunately, hourly backups were unavailable as our off-site backups had run out of capacity. This went unnoticed due to the absence of notifications informing us to proactively grow the capacity as required. Moodle.org was then restored to the latest complete backup available (19th October), and the site was placed into read-only mode as we worked on trying to restore the lost data. We've experienced further delays in doing so, and as soon as we manage to retrieve the data successfully, we will reintegrate the missing data into the live site in an archived format.

Moving forward

In response to the recent issues, considerable changes to our infrastructure have been made.

These changes include modifications of our backup scripts, implementation of a new database backup script, migrating to a simplistic infrastructure model and no longer using a cluster environment, re-tuning web and database servers, upgrading to php5.5 and apache 2.4, and migrating to an in-house backup solution.

What we've learnt

Keep things simple, control as much of your environment as possible, never allow automatic software upgrades, and there is such a thing as too much coffee.

In continually improving and enhancing our infrastructure, moodle.org has been upgraded to the latest 2.6 beta release, which comes with a host of awesome new features, bug fixes, and significant performance enhancements of which I am sure you all will enjoy.

Lastly, I’d like to thank the Moodle community for your patience and understanding as we work to continually improve our infrastructure to support the community needs.

Although our hosting provider has apologised for the complications caused by their system issues, this does not go any way to restoring the data lost as a result of these issues.
We will be asking them to investigate the problems with their snapshot system in greater detail.

Best Wishes.

Matt

Moodle Systems Administrator

Average of ratings: Cool (7)
In reply to Matt Spurrier

Re: Lessons learnt from the crash

by Guillermo Madero -

Hi Matthew,

Thanks for the description of how everything happened.

I definitely agree with "Keep things simple, control as much of your environment as possible, never allow automatic software upgrades", but not so sure about the "too much coffee" smile

Cheers and keep up the good work! Yes

In reply to Matt Spurrier

Re: Lessons learnt from the crash

by Matt Bury -
Picture of Plugin developers

Thanks for sharing some of the details and what you've learned. Very helpful indeed

In reply to Matt Spurrier

Re: Lessons learnt from the crash

by Frankie Kam -
Picture of Plugin developers

Hi Matthew

Glad to hear a detailed report on what happened. Meltdown, Defcon2. Server crash. Many thanks for getting things back on track. I suppose I had something to do with Moodle.org "running out of capacity". What with all my posts on the quiz and theme forums these past few years, and more intensely, over the last three months. I've also learnt something. I've learnt to let go *sniff* of those posts now gone to Moodle heaven. I've learnt also "not to sweat the small stuff". Finally, great that Moodle.org's been upgraded to Moodle 2.6 and that even better things lie now and in the future. To infinity and beyond, eh?

Regards
Frankie Kam

In reply to Visvanath Ratnaweera

Re: Lessons learnt from the crash

by Derek Chirnside -

I thought it was about time we were due to another Moodle.org oops.  We have some great new processes in place.  But when/if it goes down, what is the plan to tell us what is going on?  Email Helen?  Twitter?  Chinese whispers?  LinkedIn?

Is there a specific plan?

-Derek