We're running Moodle on AWS using the following setup:
- Autoscaling group of t3.medium EC2 instances running both web and cron (EC2 instances have Apache + PHP-FPM 7.3.x)
- t3.small Redis cache
- Serverless Aurora database
- EFS for uploads
We moved to this configuration in July 2020 and performance has been fine, but over the last week, we've seen some odd behavior. For no discernable (as yet) reason, and at random times (no more than once every 24-30 hours), one of the EC2 instances will show as unhealthy, with long response times. We can't correlate that with a memory or CPU spike. The node always recovers on its own and within a few minutes. Normally users don't even notice, though we've had some reports.
One thing that does tend to happen is that the Annotate PDF job fails uncleanly, leaving the lock in place. We're not sure if this is a symptom or a cause--that the job causes the unresponsive behavior as it fails, or that as the node becomes unhealthy the job fails. Each time we verify that the scheduled task is not actually running and manually clear the lock.
We're running Moodle 3.7.7 and using the Google Drive converter for conversions. We tried scaling the size of the EC2 instances and the Redis cache but experienced another failure, so it's not that. We've set up logging for cron to see if we can catch the next failure but we're not even sure that's the problem. The failures can happen at all times of day and aren't tied to heavy usage.
Thank you in advance for any insight.