Advice : AWS Capacity Planning for 20,000 concurrent users

Advice : AWS Capacity Planning for 20,000 concurrent users

by Sid Mad -
Number of replies: 13

Hi,

We are developing a learning platform on Moodle. One of the client requirements is that it should be able to support 20,000 concurrent users which is huge. For AWS capacity planning, we were load testing and running into performance issues. I have read through many posts on this community , but still do not have a clear direction on it.  Below are the details of the setup I have presently configured : 

Setup:

We have a 100GB M4.XLarge network file system mounted on multiple 50GB M4.XLarge instances (vCPU-4 Mem(GiB)-16). 

These instances are connected to a load balancer and the traffic is distributed equally. 

The instances are connected to m4.xlarge MySQL RDS. 

All storage volumes are Provisioned IOPs with IOP 1000. These are all on a single AZ. 

Moodle is on the instances with moodledata residing in the file share. The sessions are maintained in the file system. 

Testing:

We are running Jmeter scripts that simulate users logging in, searching for courses, enrolling, reading through the course material(SCORM packages of 40-Mb size). 

The performance that we see is not good and it hits a low at 256 concurrent users with a ramp up time of 60, it gives a latency of around 54 seconds. 

One thing we have observed is that the CPU utilization is around 70% and Memory utilization is ~ 55%, which seems to indicate that the instance capacities are still not being used to their potential.  What could be the reason for the sub-optimal performance?

These are the queries that I have: 

  1.  What would be a back of a napkin baseline configuration (AWS instances) to support this architecture and 20,000 concurrent users? 
  2.  Is there an alternative architecture approach which is more resilient and scalable?
  3.  Are there examples of any real world moodle systems that handle this kind of traffic and what is their architectural approach?

Any guidance/ advice would be much appreciated.! 

Thanks

Sid


Average of ratings: -
In reply to Sid Mad

Re: Advice : AWS Capacity Planning for 20,000 concurrent users

by Michael Spall -
Picture of Core developers Picture of Testers

Sid,

We don't have the same numbers that you reference. We have max users of around 400/second. You might need to talk to a Moodle Partner if you really mean 20,000 users/second.

What type of network file system are you using and how is it mounted? We are using Gluster with the native client. The shared file system needs to be fast enough for the shared file cache. I wouldn't use NFS. Is the shared file system on one machine?

How is your MUC and session cache set up? Double check that you have that set up correctly. We have two different memcached's, one for session and one for application. We have 3 different fie system caches,  one local to each web server and two on the shared file system. This is why the shared file system needs to be fast.

Try this benchmark test: Moodle BenchMark

How many M4.XLarge web servers do you have?

We found that increasing machine type one level doubled capacity. See if that is true on your set up.

How big is your DB? Does it fit into memory, or are you having to do a lot of disk reads? Look at Aurora for your DB.

In reply to Michael Spall

How should be the number of concurrent users in Moodle defined (and measured)?

by Visvanath Ratnaweera -
Picture of Particularly helpful Moodlers Picture of Translators
Hi Michael

What is users/sec exactly? I would imagine at 100 users/sec the site has 6,000 more users logged in after a minute. That is not what you have in mind, right?
In reply to Visvanath Ratnaweera

Re: How should be the number of concurrent users in Moodle defined (and measured)?

by Michael Spall -
Picture of Core developers Picture of Testers

Visvanath,

Concurrent users isn't a well defined term. I use Moodle logs and count distinct users per second. We find that translates well to server load. For us, distinct users / minute isn't as directly related to load. 

In reply to Michael Spall

Re: How should be the number of concurrent users in Moodle defined (and measured)?

by Visvanath Ratnaweera -
Picture of Particularly helpful Moodlers Picture of Translators
Hi Michael

When you say "I use Moodle logs and count distinct users per second", it can't be the number of (distinct) users who are logged in during the duration of that second, isn't it the number of HTTP(S) requests sent by (distinct?) users during that one second?
In reply to Visvanath Ratnaweera

Re: How should be the number of concurrent users in Moodle defined (and measured)?

by Michael Spall -
Picture of Core developers Picture of Testers

Visvanath,

Correct, that is why the term concurrent users is problematic to begin with and why I use requests by distinct users per second. Concurrency should be measured by number of concurrent threads/processes, http connections, db connections, and disk IO. Requests by distinct users per second is an easy to gather proxy for those stats.

You can get at concurrent requests from Moodle logs if you know the average time for a request to complete.

Average of ratings: Useful (1)
In reply to Michael Spall

Re: Advice : AWS Capacity Planning for 20,000 concurrent users

by Sid Mad -

We have reduced our testing environment to a single M4.XLarge instance (vCPU-4 Mem(GiB)-16) and 1 m4.xlarge MySQL RDS. We removed our NFS server to eliminate as many potential bottlenecks as possible.  We currently have not implemented any custom caching mechanism. We are very open to caching suggestions. smile 

In reply to Sid Mad

Re: Advice : AWS Capacity Planning for 20,000 concurrent users

by Howard Miller -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

Simple question... 20,000 concurrent users. Prove it!!

I am not being facetious. The first and vital step is to fully justify that figure. Where did it come from and tell me exactly what you mean by that? 

I see this all the time. Man in a suit plucks a massive number out of the air with no basis in reality or understanding of what it means. If you get this wrong then you will have big problems, or big costs, or both.  You need to understand the difference between "must be able to handle 20,000 concurrent users" and "we will pay for 20,000 concurrent users". Especially if the norm is more like 500 (which is still a very big site by the way) wink

In reply to Howard Miller

Re: Advice : AWS Capacity Planning for 20,000 concurrent users

by Sid Mad -

@visvanath, @ howard
I understand the concern regarding 20,000 concurrent users and we are trying to re-estimate the numbers again with inputs from the client to get more realistic figures. Having mentioned that, the system should be able to support at-least 20,000 logged-in users and I will share the re-estimated concurrent users for your input soon. Our current test script only involved logging-in and we are still getting latencies up to 40,000ms with just 256 users. I hope that partially answers your questions. 

In reply to Howard Miller

Re: Advice : AWS Capacity Planning for 20,000 concurrent users

by Albert Ramsbottom -

I agree

I went to Spain to set up a Moodle cluster for a potential 2.5 million users across 26 member states of the EU. The most concurrent users (Users per minute), we ever got was 886 surprise


Men in suits!!

Albert

In reply to Sid Mad

Re: Advice : AWS Capacity Planning for 20,000 concurrent users

by Paul Stevens -

Hi Sid,

We (Catalyst IT) have implemented a site significantly larger than that (4M users, 320,000 SCORM course completions per hour), note there was a significant amount of work that went  into the architecture and there were some seriously large servers involved.  We called this 80,000 concurrent. Page loads were under 1 sec at load.

We have some large AWS implementations in place - which perform significantly better than your describing. It would be good to confirm what 'concurrent means' and the key is understanding what the learners and other users are doing around for testing your peek load - that would significantly determine what we would recommend in terms of design.  The Database is almost always the bottleneck once you have your architecture sorted for the larger sites.  Note you need to take into account what your SCORMS are doing to as there can be a lot of data written back to the database which is significant if you have 20,000 completions within 5 minutes.  So if the database is getting hammered see that your SCORMS are not doing to many DB writes.

Cheers, Paul.



Average of ratings: Useful (1)
In reply to Paul Stevens

Re: Advice : AWS Capacity Planning for 20,000 concurrent users

by Sid Mad -

We would like to set up a system that can handle a similar kind of traffic to what you described, although we do not need blogs, forums or assignments. The SCORM packages we plan to host will have quizzes folded into them and the grading will also be a part of it, so we do not need the quiz or grading plugins.  However, we were currently only testing logging in, but we were still observing unacceptable latencies.  Would you be able to share with me your present AWS architecture and any performance tweaks that you have made to Moodle to handle this kind of traffic?  Specifically, we are interested what instance type / size would serve as a good starting point for bench-marking, any ball-park estimate on the number of instances we would need for our production server, and if we should be using any services other than EC2 or RDS prior to autoscaling.

Your help/guidance is much appreciated! 


In reply to Sid Mad

Re: Advice : AWS Capacity Planning for 20,000 concurrent users

by Tomasz Muras -
Picture of Core developers Picture of Plugin developers Picture of Plugins guardians Picture of Translators

It seems like the performance you get is too low. My first suspect would be file system - so Moodle sessions and Moodle cache. Try with memcached.

If you publish your test data (jmeter tests, moodle code, DB and data used for testing) then I can run the same tests on the same data on my local setup - to give you some comparison.

cheers,
Tomek