Periodic slowdowns on AWS

Periodic slowdowns on AWS

by Chris Fryer -
Number of replies: 5

Hello

We are running Moodle on AWS. We have recently noticed high Time To First Byte measurements when loading courses and activities. This is not always repeatable. Sometimes the TTFB is < 2 seconds.

We are running all our exams on Moodle because we can't hold in-person exams (for obvious reasons). We use AWS Auto Scaling to scale out in the morning to cope with the load, and scale in again in the evening.

Right now (the evening in the UK), we have instances that have been running for a while and TTFB is consistently under 2 seconds.

This makes us suspect that the instances we start in the morning have cold local caches, and this is making the initial compilation of the courses, activities, etc, very slow.

We use ElastiCache (Redis) for the shared MUC, and APCu locally on the instances to cache MUC entries that don't need to be shared by all instances. We have also defined localcachedir to be local to the instance. localcachedir contains the usual stuff: JavaScript, theme files, and PHP generated by Mustache

We plan to run some tests to see if un-definining localcachedir and/or using Redis for all MUC, including stuff that can be local to each instance. But I'm wondering if anyone already has some experience with this and can say: yes, it's better not to have things cached locally on instances that don't live very long.

Average of ratings: -
In reply to Chris Fryer

Re: Periodic slowdowns on AWS

by Alex Rowe -
I remember doing a bit of reading about how Amazon disks work when sharing EBS volumes across instances and seeing possible issues when there are lots of small files being accessed and then hitting IOPS limits.

If you have some monitoring to show that disk (or something else) is the issue, then that might help you narrow it down.

From the last bit of reading, I think some users were doing NFS shares themselves instead of relying on sharing an AWS volume.

This is all conjecture though as I don't have anything to back it up without getting back into it.

In our load balanced environments (non AWS), we have three redis instances on a single host, 1 for each Application MUC, Session MUC and PHP Sessions. The only thing local would be localcachedir and PHP opcache.

You could also try to pre warm PHP opcache before adding the instance to the LB pool, and maybe copying (rsync) the localcache dir to it's kind of pre warm too.

The other part to look at would maybe be caching more assets at the front end load balancer (if possible, but don't think the Amazon default LB will let you) so that all CSS/JS/theme assets/etc are all served from that instead of the instances.

With all that said though, I would be taking a look into your shared disks and IOPS issues first.
Average of ratings: Useful (1)
In reply to Alex Rowe

Re: Periodic slowdowns on AWS

by Chris Fryer -

Thanks Alex. We are using a CDN (CloudFront) for static assets, and EFS (Amazon's NFS) for moodledata. The instances don't share EBS volumes.

> You could also try to pre warm PHP opcache before adding the instance to the LB pool

Great tip. I'd forgotten that OPCache would also be cold. We're not yet using PHP 7.4, which would make preloading opcache more straightforward, but this gist might do the trick.

In reply to Chris Fryer

Re: Periodic slowdowns on AWS

by Alex Rowe -
I believe it was EFS now that I have done another search of the forums. Check out posts like this, https://moodle.org/mod/forum/discuss.php?d=384641, or search for some others.

Lots of other users see slow performance in areas once moving to EFS, but for some it seems to be good enough.

Specifically it's about the part where performance is relative to the size of your EFS instance (or something like that) for how IOPS get provisioned. When using the Amazon version of EFS you're at the whim of their shared storage and if someone else needs it more (or pays more) they get priority.

Users have got around this by running their own NFS (or Glusterfs etc) servers via normal Amazon disks and sharing that between instances as well.
In reply to Alex Rowe

Re: Periodic slowdowns on AWS

by Chris Fryer -

Hi Alex

Thanks again for your advice. I thought it might be useful to show some of the telemetry we observe in our environment. This is a typical pattern of requests during our exam season:

The spike in traffic at 11:00 UTC coincides with the release of exam papers. Candidates camp on the assignment or quiz page and refresh until the paper is released. We only get anomalous traffic from 10:58 UTC, which isn't enough time for AWS Auto Scaling to respond (even though we've built AMIs that take less than 30 seconds to go into service).

We scale at 07:50 UTC, which should give the instances enough time to get warm. But the following graph shows the 95th percentile of response times for the instances:

Note that the p95 doesn't even approach the performance of the instances that have been running overnight until late in the day.

This makes me think it must be something local to the instances themselves. I've had a look at the OPCache stats on each instance, and they do show ~70% hit-rate compared to their overnight brethren, but once you've cached a file, that's it. We are more and more convinced it's a MUC issue.

We could abandon APCu and keep everything in shared cache, but are concerned we'd be overloading the Redis instances. Here's another (insane) graph showing traffic out of Redis:

ElastiCache is shipping more than 3GB to the instances at the time the exams are released.

So I'd really like to know if there's a way to prime the APCu caches.

In reply to Chris Fryer

Re: Periodic slowdowns on AWS

by Alex Rowe -
The Opcache hit rate is going to be a bit out for the first part due to all the initial misses vs hits. Over time, there are so many more hits to misses, so that would by why it's showing much higher as the time goes on. The only way I think to keep the Opcache hits always high (even from the beginning) is to pre warm them.

Can you see any per instance stats like disk, IO, system wait times or anything like that which may show other areas which could be affected?

In our load balanced environments, we only have the central Redis instances. There aren't any local caches on the individual app servers apart from the localcachedir.

Not having much experience with AWS either, I don't know how their older implementation of Redis works and how it interacts with their internal networking (e.g. prioritisation over their other traffic or how it works on their internal/external networks).

Being in the same circumstances around exams, it's hard to just try something and see as you don't know if it is going to affect the traffic in the negative direction.

Local caching, in theory, should be faster but first it has to go back to a disk and then cache it which is where the performance hit will be. From then it should be faster, but you may want to look at storing all items in Redis and swap back if it's not working correctly.