1 Netscaler doing round robin load balancing to the 5 app servers
5 App servers (redhat, apache) (1 of these servers is running the cron)
1 NFS server that hosts the shared sitedata
1 MySQL DB server
We do not have a caching server (redis) setup yet.
We are experiencing a major issue that once we seem to get close to 500 or so db connections, the site performance really begins to slow down.. anything over 1000 and it's very very slow.. 2000 and it's basically unusable.
Apache is using default prefork.c MPM (i've heard we should switch to worker or FastCGI and PHP-FPM (anyone have information on doing this?)
We have gone through all of the settings we can think of but could really use some help in figuring out where our bottleneck is.. our servers are very overpowered.. as far as ram and CPU.. they are not maxxed.. the only correlation we have is that it correlates to spikes in the dbconnections.
Where are your sessions being stored now?
Also, have you configured PHP/Apache/MySQL to use all your RAM/CPU?
How many users are online at a time when you're seeing these levels of DB connections? I normally see about 1/4 or less DB connections to the number of online students.
The DB connections may only be a symptom for example if you aren't caching and relying on NFS to serve the MUC and the database to do sessions then everything will slow down once more users access the site. The more users that access, the slower it gets so the higher the DB connection count will be.
At first we had the sessions and db locks being stored on the NFS server. And just since we've been trying all kinds of stuff, they are currently being stored in the database.. we "think" we've seen a slight improvement by taking that load off the NFS server. Which we "think" is really pointing to us needing to get the REDIS server setup so that the cache is not being served off the NFS server (assuming that's where it's at since that's where our entire sitedata directory is). I'm also pretty sure keeping the sessions / locks in the database is not exactly ideal either, but we were thinking that our NFS server just wasn't able to handle doing it all.
When you ask if we've configured php/apache/mysql to use "all" of our ram/cpu, can you elaborate on that a little more? I know we have a lot of head room as far as how much CPU and Ram utilization we see.. even when our system is ground to a halt at that magic 1500/2000 db connection number.
When things go south, we were seeing user logins at about 600-700 /per 10minutes.
For the PHP-FPM + Apache Event MPM .. is there a step by step guide somewhere for how to install and configure that?
Also, when standing up a REDIS server, is it best to be on it's own box or should/could it share with an app server / nfs or db server? Are any of those preferable ?
Would the REDIS server just host the Sitedata/cache and Sitedata/session directories? We ALSO have a whoopsie (from what i have read now) in that our Sitedata/localcache is also on the NFS server.. from what i've read recently it looks like that folder should live and be local on each of the individual app servers.
Thanks again for your help!!!
In my experience, NFS works well if it's just for the bigger file storage and all other types should be moved to another form of storage. Lots of requests for lots of small files doesn't scale well on NFS.
Sessions on NFS and DB could be just as bad as each other, we saw good improvement when Memcache(d) first became available but have since moved MUC and PHP Sessions to Redis. There is also a redis lock cache plugin but I haven't looked into that much.
I guess Redis COULD live on the same server as NFS or DB, but I prefer to keep them separate as Redis needs some OS changes to work best. We also run two instances of Redis, one for MUC and one for Sessions so we can use different Redis configs. As Redis is only single threaded, it doesn't need to have multiple cores etc, just as much RAM as you think is required. MUC about 1GB max with LRU policy and Sessions only needs hundreds of MB. Maybe start with higher values, monitor then reduce as needed.
To properly configure PHP and MySQL the basics are working out how much RAM you have available, how much each process uses then editing the config to work out how many maximum processes your server can handle. With just the default configs, they can't handle very much but you probably would have seen issues before now.
PHP-FPM is working out each PHP process size, then dividing that by RAM available and setting max processes in the config. MySQL is working out your total buffers + largest InnoDB buffer pool that can all fit into RAM.
I don't have any specific guides for Redhat, but search for a few articles on how to install Apache Event MPM, PHP FPM and there should be heaps. PHP FPM uses a shared cache to be able to run more processes on the same amount of RAM (which I think makes Opcache better too?) where prefork creates individual instances and consumes more memory (but it's easier to configure).
Your eventual set up should look like:
App Server: Moodle code + localcachedir
Redis: MUC and PHP Sessions
Redis doesn't "host" those directories, they will still be in the sitedata folder but won't be accessed anymore as all the requests for that type of content will now be in Redis.
Also agreeing with the assessment that a high number of DB connections might only be a symptom. What is the load average on your app servers displaying? When we had lots of external wait times on NFS, we had load averages going through the roof.
Some additions regarding...
... Apache / MPM-Event & PHP-FPM:
- Remember to ensure that the Zend Opcache is enabled.
- In conclusion, remember to take shared memory into account when calculating the average size of an apache process.
- Save up some memory in your calculations for the pool of php processes from the FPM. We found the "static" setting to be most performant.
... Sessions / Caching / Locks & NFS
- Sessions on NFS are an absolute nightmare regarding speed. DB sessions are an improvement but won't really help when number of parallel DB connections is your issue. Go for Redis.
- MUC should also be set up in Redis, yes. This can significantly speed up page load times. Localcache should reside on app server locally.
- Temp-Directory should also be shared by all app servers.
- Locks in DB might bring a little speed, but we've found some issues with DB locks never being released (instead waiting for expiry date) and thus parallel cronjobs being made impossible.
... Database optimization options:
- Single largest performance factor we had when optimizing our database was disabling the query cache.
pm.max_children = 1024
query_cache_size = 0