We are running moodle 1.9.2 sites for about 25 school districts here in Oakland County, MI. Recently we just went from our 1 web server (debian with apache) 1 sql box solution (centos mysql) to a clustered solution with 2 load balancers that run lvs, 2 web servers and the same SQL box. The web servers use NFS to store the moodle data folder that also holds the session information. We also use weighted least connections on our load balancers. The one nagging issue we seem to have is that after about 2 days of uptime users can no longer upload files to the site. The fix we found was to failover the load balancers and then it worked just fine for about another 2 days and then we'd have to repeat the process. When the users uploads fail they don't get an error or anything and we're not seeing anything in the apache / system logs on the boxes. One other note the webservers are direct routed instead of natted. Also here is the documentation I used to setup our system if anyone is interested:
Any help is appreciated.
Without getting hands directly on the boxes and network, it is a little hard to say exactly, however here are some things to try:
1. Sessions - try storing these in the database.
2. LDirectord - what are your configuration setting for the period of time that the load balancer keeps the user on the current web server - it's called persistence. There is a user comment in the load balancing tutorial (page 2) about this. You may want to try this alot higher. With the number of users you have your load balancing will not be compromised. In demonstrating this same architecture at the 2005 Moot in Adelaide I set the persistence very low, and was able to show one page fetching the content from up to eight different web servers (images etc) - made great showmanship, but in reality did not achieve much else.
I thought about doing the sessions in the database especially since our databases does not seem to be working really hard at all.
I should have probably mentioned I have turned on persistence it's set to 300 seconds I will probably be bumping that up just so it matches the time out inside of Moodle. We have firewall marks set as well for the forwarding instead of the normal tcp connections which I think I will be change back to tcp because there is something wonky with it and the kernel when you just try to restart the service. Anyway though yes we have peresistence turned on and still getting the issue. I'll try turning it up though.
I'm trying to work through the mechanism of what might be happening, and if resetting the load balancers corrects the issue, then I think it's a resasonably safe bet to focus there for the moment.
The next step is logging and analysis. Look here
for some help in setting up logging for heartbeat and ldirectord.
Is there anything already in the apache logs to indicate the problem?
Once the system goes into the state where uploads fail, is it every file upload, or just some - are they large files? Can small files still be uploaded? Some parameters here may help point to the cause of the problem.
Once uploads stop working it stops working for what I thought was all of our sites, but I'm getting different answers from people now. Some are saying that their uploads still work others are saying theirs don't. Also it's all files that can't be uploaded to those sites.
I presume when you say "timeout", that you are referring to the LDirectord persistence value.
If so remember that LDirectord persistence is in seconds, so 120 minutes is (120*60) = 7200.
File size could well be the issue. Small files go up quickly, larger ones take longer, and hit the persistence limit perhaps.
Great. Those docs look really good. Tying the load balancing to the IP address of the user seems to be good logic.
You can then keep the session management with the web servers and not have to hit the Db server!
All the best.