Load balancer problem with 100 simultaneous users taking a quiz

Load balancer problem with 100 simultaneous users taking a quiz

by Don Hinkelman -
Number of replies: 28
Picture of Particularly helpful Moodlers Picture of Plugin developers

With 100 simultaneous users taking a 60 question quiz with audio and images, our school's Moodle slowed down to a crawl, giving everyone spinning balls on the browser for 2-3 minutes to refresh a page.

Our LAMP server monitoring software showed no problems anywhere--loads were minor.  No client problems, no Moodle problem, and no network problem.  It was a mystery until we turned off the load balancer (thus running moodle and its database from one server alone).  Then we had no problems administering the quiz, even up to 200 simultaneous users on the same quiz the next day.

Why is the load balancer causing the problem?  That we do not know.  We do suspect that our settings and design could be the problem because we built the load balancer ourselves to save money, using open source Pacemaker, Ultramonkey, and Pound.

If anyone has experience with building an open source load balancer, we would like to know what worked for your situation and how we can avoid slowdowns.

Average of ratings: Useful (1)
In reply to Don Hinkelman

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Howard Miller -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

Of course, an even cheaper solution is to throw the LB away and just use DNS round-robin. 

As long as you are monitoring your web servers properly it works fine and avoids LB oddness. Just a thought...

In reply to Don Hinkelman

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Marcus Green -
Picture of Core developers Picture of Particularly helpful Moodlers Picture of Plugin developers Picture of Testers

I have seen a situation where chucking out the load balancing lead to a dramatic improvement in performance, Not sure why.

In reply to Don Hinkelman

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Andrew Lyons -
Picture of Core developers Picture of Moodle HQ Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers Picture of Testers

Hi Don,

I assume that this is a Linux server, so check the relevant logs (the software in question, and also the kernel logs).

I wonder whether you may be hitting a max open files limit, or something along those lines. You'll see this in standard `dmesg` output, or in your kernel logs. I forget the exact output you get, but it will be something about either max open sockets, or max open files per process, or max open files. Typically it's relatively low on a new system and will need to be tuned up according to the capability of your system. Whenever your load balancer receives a connection it uses a socket for the inbound, and then an additional socket for the outbound connection. Others may also be used if you have intermediary steps.

Have you got any monitoring on your system (e.g. Icinga)? This can be invaluable in helping you to track down such issues.

I've not used Ultramonkey with Pound before -- I've always used haproxy extensively for Moodle loads with no ill-effect. It's a powerful, free, open source, load balancing tool, with low TCO and setup. It's relatively easy to setup, with a good manual. I've run it for pretty large installations (>= 20,000 current users) with no ill effect. The latest versions have SSL capabilities built-in (previously you could not terminate with haproxy), and supports quite a number of features. At Lancaster, we ran a pair of haproxy servers (2vCPU, 1GB RAM) with Keepalived for IP failover between the boxes. We didn't bother with Pacemaker or heartbeat - we deemed it unnecessary for a pair of dumb boxes which just passed stateless traffic.

Hope this helps,

Andrew


Average of ratings: Useful (1)
In reply to Andrew Lyons

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Matteo Scaramuccia -
Picture of Core developers Picture of Peer reviewers Picture of Plugin developers

Hi Don,
I've used pound 6 yrs ago in a Moodle cluster and gave up for performances reasons after actually few months of perfs testing: it was much better to use mod_proxy_balancer, which is still a nice option if you don't want to use nginx or haproxy.

HTH,
Matteo

In reply to Don Hinkelman

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Jeff White -

How is your load balancer set to handling http traffic? Can you have it do load balancing HTTP traffic with cookie persistence? I think a few things start acting up for users when you have them bouncing between apache nodes while in the same session. I only noticed errors with our load balancer with items using localcache and localtempdir set on each node.

Since documentation on what parts of moodle user localcache and localtempdir its a little hard to say if a user going through Apache node 1 is taking a quiz and when they click next they jump to Apache node 2 but the work done on node 1 has stored files and caches in tempdir and localcachedir that the Apache node 2 is looking for within its own local folders. So your browser is stuck spinning waiting files to to load. 

In reply to Jeff White

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Tim Hunt -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

You should not need cookie persistence for any part of Moodle (unless you have set up caching wrong).

localcache and localtempdir should only be used for things within a single HTTP request. Anything else should be in MUC / session / database, which is shard between all web-server nodes.

Average of ratings: Useful (1)
In reply to Tim Hunt

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Jeff White -

I would just recommend for him to try it out. It would be interesting to see if the user gets the 3 minute load times if they stay on the same Apache node the entire session without having to disable the load balancer. 

There have been encounters where certain things should use the shared caches/temp directories and they do not. MDL-48642.

In reply to Jeff White

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Andrew Lyons -
Picture of Core developers Picture of Moodle HQ Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers Picture of Testers

Hi Jeff,

Thanks for pointing out that issue - I've now closed it as it was related to mis-documentation which has been updated.

As that issue proves, when files are missing from their expected locations, we do display errors and warnings. We do not hang silently for three minutes.

Andrew

Average of ratings: Useful (1)
In reply to Andrew Lyons

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Jeff White -

Oh boy. That means I need to change some settings on the config.php for my environment. Kudos Andrew. This will solve my issues I have been encountering and is much better than using persistent or affinity load balancing to keep the users on 1 node during their session. 

I see updated the tempdir section but there is a note for the performance recommendations that also needs to be changed. 

"

  1. Set $CFG->tempdir to fast local filesystem on each node.
"
In reply to Jeff White

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Andrew Lyons -
Picture of Core developers Picture of Moodle HQ Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers Picture of Testers

Hi Jeff,

As Tim says, there should be no need for session stickiness within Moodle (cookie persistence is one type of session stickiness).

In any case, Moodle is pretty good about the way in which files are stored so long as the underlying architecture is correct - that is to say that the tempdir and and various other shared caches are indeed shared.

If Moodle is unable to serve files (e.g. because they are indeed missing), then it will not just sit and spin - it will return an appropriate error code, such as a 404 (Not found).

Clearly this solution is working in some aspects for Don, but there are issues when the system load is higher.

Andrew

In reply to Don Hinkelman

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Andrew Lyons -
Picture of Core developers Picture of Moodle HQ Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers Picture of Testers

Hi Don,

Another throught occurs: Where are you storing your sessions?

I would recommend not storing them in moodledata, especially if it is on a shared filesystem. The reason for this is that you may encounter file-locking issues.

I would recommend that sessions are stored either in DB, or if you prefer in memcached.

These could be a cause of your slowness, but I would expect to see these generally all the time, not just when the server is busy.

Best wishes,

Andrew

In reply to Andrew Lyons

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Christian Stehle -

Hi there,

we have the same problems here with a fresh Moodle28.

On RHEL7 (Centos7) we're running 2 Loadbalancers (Haproxy in roundrobin mode, Keepalived for Virtual IP, SSL-Termination), 2 Apache, 1 NFS Server and 1 DB Cluster (Mysql Galera in Master/Slave mode).

Everything looks fine, but when testing under some load with jmeter, one apache slows down after 25-30 user logins within 20 Seconds. If I use 1 Apache or using the HAProxy in Master/Failover Mode, everything runs fine. 

I think performance/ressources is not the issue:

The LBs have 4 cores/4GB Ram

The Apaches have 4 cores/8GB Ram

The NFS has 20GB Ram

The Galera Cluster has 8 GB Ram on each node

The CPU Usage is fine.

Moodledata is shared on NFS, also the www-root. The localcache is local up on each machine. We have no errors, CPU usage is low, bandwith also. I have no clue whats slowing down.

Sessions are stored on the DB, but also if storing on NFS (V4 with locking), theres no difference.

It's really strange. Any help would be great!

Christian

In reply to Christian Stehle

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Howard Miller -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

How is your shared cache arranged?

In reply to Howard Miller

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Christian Stehle -

dataroot ->  shared on nfs

Localcachedir -> local on disk

tempdir -> shared on nfs

cachedir -> shared on nfs

(all as described here https://docs.moodle.org/28/en/Server_cluster)

We're using NFS V4 with file locking.

Sessions are stored in DB.

In reply to Christian Stehle

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Howard Miller -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

There's your problem right there. You can't use NFS for caching - it's far too slow. Set up a memcached server and point MUC towards it. I expect you will see a dramatic improvement. 

In reply to Christian Stehle

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Galin Vassilev -

Hello,

Couple of points to consider. One is to make sure you use OpCache with PHP. https://docs.moodle.org/29/en/OPcache  Second, if I understand you correctly, you have both "wwwroot" and "dataroot" directories shared over NFS. I would recommend moving the wwwroot directory to a location on the local disk of each web server and see what that does performance wise. 

Cheers

Galin

In reply to Galin Vassilev

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Christian Stehle -

Hi,

ok, I moved WWW-root to local filesystem, no Change in speed.
opcache is installed correctly and is running.

I have this test case in jMeter:

- Login with LDAP Credentials
- Logout

30 Threads within 10 seconds.

I'm using registered ldap test-accounts. If I run both apaches with haproxy, the Maximum Response time of Login/index,php is 30930ms. If i shut down one Server (doesn't matter if ap1 or ap2) it's max ~950ms (min is about 350ms).

Ldap is not the Problem, it also slows down when using a single local account.

But it's only the Login-page which slows down, all other pages run without Problems.

I switched to local stored sessions instead of DB - no Change.

Christian




In reply to Christian Stehle

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Jeff White -

Does your organization have access to a F5 load balancer? Perhaps just try switching out to different load balancer method or product? 

In reply to Christian Stehle

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Galin Vassilev -

I would agree with Jeff. Based on your tests the next possible test is to replace HAProxy with something else. May I suggest just simple LVS? Using keepalived? We use that exclusively in my organization to great success. If you look at your numbers even, the difference between a 30930ms and 950ms is almost exactly 30 sec. Without knowing enough about your exact configuration parameters, it just feels like a timeout interval of some sorts.


Cheers.

In reply to Galin Vassilev

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Sergio Rabellino -
Picture of Particularly helpful Moodlers Picture of Plugin developers

I'm using haproxy with moodle since 2009 without problems, it's strange thinking that a such so stable product could work so bad. These timeouts tipically could be about a failed DNS resolution on the system where haproxy lives: if the dns resolution is not correctly configured or the dns could not be contacted,  and you are asking haproxy to do logging with reverse resolution, then the bad performance are clear...


hope this helps.

Average of ratings: Useful (1)
In reply to Sergio Rabellino

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Howard Miller -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

I find it really surprising that HAProxy would be (in itself) the issue. In the great scheme of things, it's not doing a huge amount. I too suspect that problems are to be found elsewhere.... e.g. DNS resolution or some network oddness. 

In reply to Galin Vassilev

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Christian Stehle -

I also thought about a timeout - but it's not a real timeout because I get back correct pages, with no error.

I tested haproxy in any balacing mode, with no Change.

The jemeter test runs this way: both apaches get about 15 requests. One of them delivers quick replys so that there are only maximum 2-3 open Connections. And the other AP is waiting at 12-15 open Connections - also the database Connection has 12-15 open sessions. When the quick Apache is finished, then the slow AP delivers the pages with a finger snip. And maybe the next test, the other AP is will be the slow/fast AP.

It seems that something ist locking one Apache so that it will not send answers to haproxy bak. It could be haproxy (but I don't think that), Database or NFS.

Well, everything runs on a big ESX, so there's no Network or ressource Problem.

Christian

In reply to Christian Stehle

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Sergio Rabellino -
Picture of Particularly helpful Moodlers Picture of Plugin developers

And no errors logged in the apache logs ? Do you checked if nfslock service is running in the nfs server ?

In reply to Sergio Rabellino

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Christian Stehle -

No errors in AP-Logs.

On NFS-Server and client, I have a running nfs-lock, nfs-idmap, nfs-server and rpcbind.

Everything as it should. but there are known bugs with Centos7.1 enabling the nfs-lock and nfs-idmap services automatically, but starting it manually works fine. 

Is there a way to check if locking is working as it should? 

-christian

In reply to Christian Stehle

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Christian Stehle -

got it!

NFS V4 did not work properly, file locks did not work.

After changing /etc/exports from

/nfs-share/     MY_IP(rw,sync,no_root_squash,no_subtree_check,)

to

/nfs-share/     MY_IP(rw,sync,no_root_squash,no_subtree_check,fsid=0) 

it works as expected!

Thanks all for your suggestions!

Christian


Testscript if lock works:

<?php
touch("testfile");
function mylock() {
  $F1 = fopen("testfile","r");
  if (flock($F1,LOCK_EX|LOCK_NB)) echo "First lock OK\n"; else echo "First lock$
  $F2 = fopen("testfile","r");
  if (flock($F2,LOCK_EX|LOCK_NB)) echo "Second lock OK\n"; else echo "Second lo$
}
mylock();
echo "Function returned.\n";
mylock();
unlink("testfile");
?>
Average of ratings: Useful (2)
In reply to Christian Stehle

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Sergio Rabellino -
Picture of Particularly helpful Moodlers Picture of Plugin developers

Interesting check script, but broken cut&paste !


In reply to Sergio Rabellino

Re: Load balancer problem with 100 simultaneous users taking a quiz

by Christian Stehle -

yes.... copied from console clown

Original from http://php.net/manual/de/function.flock.php

<?php
touch
("testfile");

function 
mylock() {
  
$F1 fopen("testfile","r");
  if (
flock($F1,LOCK_EX|LOCK_NB)) echo "First lock OK\n"; else echo "First lock FAIL\n";
  
$F2 fopen("testfile","r");
  if (
flock($F2,LOCK_EX|LOCK_NB)) echo "Second lock OK\n"; else echo "Second lock FAIL\n";
}

mylock();
echo 
"Function returned.\n";
mylock();

unlink("testfile");
?>
Average of ratings: Useful (1)