Diagnosing issue with Moodle caching

Diagnosing issue with Moodle caching

by Alistair Spark -
Number of replies: 25
Picture of Core developers Picture of Particularly helpful Moodlers Picture of Plugin developers

Hi all,

We've recently upgraded to Moodle 3.7 with Adaptable V2 on rebuilt servers, since we've gone live we have been seeing two really problematic issues:

1) Need to login twice to be able to get in (just loops on the login page) -> this seems linked to the logintoken (https://docs.moodle.org/dev/Login_token) being cached somewhere but it doesn't matter if user has visited site before or not.

2) Session Timeouts occurring randomly

We're only seeing the issue on production and not on pre-prod (server setup identical, code nearly identical -> includes a few minor bug fixes we are deploying to prod next week).


We've managed to confirm / reproduce the issue reliably by using this Firefox plugin - https://addons.mozilla.org/en-GB/firefox/addon/auto-reload-tab/ 

On Prod:

The issue does not occur when using Adaptable & disabling browser cache

The issue does not occur when using Boost (with or without browser cache enabled)


The issue occurs with Adaptable and browser cache enabled.


On Pre-Prod:
The issue never occurs regardless of themes & caching.


While I am staring at the theme as a potential culprit, we don't see the issue in any other environments so it's clearly not that simple.

We use Redis for Session caching (on the NFS server) and file caches for everything else. Have moved to using localcache on each of the app servers. While we do need to look at using Redis for more caching types, it seems like a potential distraction from the problem at hand right now.


Would really appreciate any advice in how to pinpoint this issue any further. We essentially have 2-3 weeks to resolve this before term starts. sad


TIA


N.B. We have had performance issues post-upgrade with the NFS mounts being NFS4 and have downgraded these to NFS3. 

Average of ratings: -
In reply to Alistair Spark

Re: Diagnosing issue with Moodle caching

by Howard Miller -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers
I'd be *really* interested to hear about your NFS4 performance issues, as we have been suffering some (intermittent) oddness since we upgraded to NFS4 and and 3.7.

Just as an aside, I'd use Redis for everything you can use it for. Caching (as I assume you must be) to NFS is an enormous performance hit as NFS is never fast enough.

I assume you have got the site in full developer debugging and are trawling the error logs for any clues? Also, running with the Javascript Console open (and friends!) open. I know... but I have to check big grin

Again, I imagine you have, but what happens if you turn off the login-token protection?
In reply to Howard Miller

Re: Diagnosing issue with Moodle caching

by Alistair Spark -
Picture of Core developers Picture of Particularly helpful Moodlers Picture of Plugin developers
Re NFS 4 downgrade to NFS3, we've seen this as being the top of the iceberg - https://gitlab.com/gitlab-org/gitlab-ce/issues/52017 & https://docs.gitlab.com/ee/administration/high_availability/nfs.html#known-issues and got pre-prod stable for a 2000 concurrent users load test on pre-prod by:
-disabled leases / locking 
echo 0 > /proc/sys/fs/leases-enable
sysctl -w fs.leases-enable=0

-setting NFSDCOUNT to 64 instead of 8 ( /etc/sysconfig/nfs )

-brought rsize/wsize down to prior rhel6 settings (32K)

-downgrade app/admin servers to mount as NFS3 (NFS server still supports NFS4) & reboot NFS server -> this is the change that made it pass the Jmeter load test and only after a reboot of NFS


That is now applied to prod. Issues manifest themselves with D state processes and very high load averages  on app servers as the threads wait to do things on the NFS server but very limited actual CPU usage. I'd like to say this is fully solved on prod, but we are still seeing some sporadic spikes in D states on the app servers which are in a different datacenter to the NFS server. Digging into that separately as potentially caused by inter-DC latency or network bottlenecks.

I had looked at using the database for locking, which did seem to help on Pre-Prod (PP) but have left that aside for now as it was really suggested for mitigating a flaky NFS - https://moodle.org/mod/forum/discuss.php?d=328300#p1320902. I looked at setting the adaptable cache to a redis store last night. We should indeed be doing more redis, but with our change management processes and extra scrutiny after those NFS issues there is a bit of an expectation to focus on the issue at hand. Would need to think through RAM requirement for that though but seeing the performance stats difference on a 5000 performance test, Redis is definetely the way to go - good to see that Redis clustering support might be on the way too.

Yep looking through Chrome Console and there's no errors that I can spot but then I have a harder time replicating this than others. That is how I spotted the pattern for the logintoken. But have yet to capture the premature timeout issue which seems to be triggered by doing a refresh of Dashboard or doing an action (enrolling user, submitting something,...).

Debugging yes but only into apache logs - not displayed to end users.

I've kept turning the login-token protection off as a last resort as our prod site is a target.


Average of ratings: Useful (1)
In reply to Alistair Spark

Re: Diagnosing issue with Moodle caching

by Howard Miller -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers
Sorry to keep grilling you about the NFS thing. Was there any particular trigger for this? We've been seeing something that sounds really similar. The site goes for several days and then we start to see increasing loads and it falls over for no apparent reason. As you say, unmounting the file system seems to be the only cure. It then runs fine for several more days.  We didn't see issues running jmeter tests - but maybe we didn't try hard enough or run the "right sort" of tests. 

On the other matter... I assume you've monitored Redis to try to see what's happening to sessions? Also the 'session cleanup' task might be worth a look. Just thinking aloud - I don't really know. You can probably tell big grin
In reply to Howard Miller

Re: Diagnosing issue with Moodle caching

by Alistair Spark -
Picture of Core developers Picture of Particularly helpful Moodlers Picture of Plugin developers
No worries. Nice to know we aren't the only ones dealing with this - would've thought NFS4 on RHEL7 would be stable by now - though we don't have kerberos setup for NFS4 so there was a theory going that there was no point going with NFS4 if we hadn't sorted that out & anyone we'd ask for help would tell us to go sort that out first.

The most we had ever load tested before my time was 500 concurrent. We passed a 600 concurrent before go-live but knew we failed at 1200 - but link between GA stats and that JMeter stats seemed to be satisfactory for prod peaks. Had a bit of a post-upgrade hickup a day after doing a chown apache:apache on moodledata to try and cure some fopen failures in the cron logs. That hickup seemed very tied to the NFS4 Lock reclaimed failed errors in /var/log /messages.

There are no evicted keys in redis but do feel the misses and expired is quite high. 
Redis Stats

In reply to Alistair Spark

Re: Diagnosing issue with Moodle caching

by Howard Miller -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers
You might want to speak Mike Ahern at Strathclyde Uni. He said the other day that they had some problems that led them to keep using file sessions. I wasn't especially interested at the time and didn't follow it up. However, if you find out... let us know big grin

If the only thing you are using Redis for is sessions then misses do seem quite high. Not that I have anything to compare it with. Most users, most of the time, will be hitting their own sessions that exist. If you asked me to guess I would have thought that hits would be considerably higher than misses. But, as I say, it's just a guess.
In reply to Howard Miller

Re: Diagnosing issue with Moodle caching

by Alistair Spark -
Picture of Core developers Picture of Particularly helpful Moodlers Picture of Plugin developers
File sessions would probably be worse seeing these stats from pre-prod but will send Mike an email:
cache store performance

I did wonder why Session cache got better results than Redis  seeing as Redis is the Session cache though. (but would probably connect to a different port of the same redis server than the one defined in config.php I'm guessing)

I think pre-prod from the load test we were seeing a ratio of 75% for the session cache hit/miss in redis so considerably higher than prod but then that is not the same usage.
In reply to Alistair Spark

Re: Diagnosing issue with Moodle caching

by Alistair Spark -
Picture of Core developers Picture of Particularly helpful Moodlers Picture of Plugin developers
I should add that the issue really flares up after doing a full purge caches (Moodle GUI or CLI or full delete of all cache folders) and does seem to settle a bit over time.

which made me consider this issue in Adaptable as being potentially related - https://gitlab.com/3bits/moodle-theme_adaptable2/issues/190#note_183687299

seeing as we were seeing a lot of these errors in the httpd logs
In reply to Alistair Spark

Re: Diagnosing issue with Moodle caching

by Alistair Spark -
Picture of Core developers Picture of Particularly helpful Moodlers Picture of Plugin developers
We do see a lot of these in the Apache logs though
[cache_disk:error] [pid xxxxx] [client x.x.x.x] AH00717: Premature end of cache headers.
[cache_disk:error] [pid xxxxx] [client x.x.x.x] AH00717: Premature end of cache headers.
In reply to Alistair Spark

Re: Diagnosing issue with Moodle caching

by Howard Miller -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers
You're probably aware... there's an outstanding bug for that, https://bz.apache.org/bugzilla/show_bug.cgi?id=59744
In reply to Howard Miller

Re: Diagnosing issue with Moodle caching

by Alistair Spark -
Picture of Core developers Picture of Particularly helpful Moodlers Picture of Plugin developers

I couldn't really find anything about it but thought it was worth mentioning since it is cache related.

In reply to Alistair Spark

Re: Diagnosing issue with Moodle caching

by Ken Task -
Picture of Particularly helpful Moodlers

In https://redis.io/topics/rediscli

There is a redis-cli that helps with troubleshooting/monitoring in realtime.

redis-cli --stat

There are other redis-cli switches that might help.

'SoS', Ken

In reply to Ken Task

Re: Diagnosing issue with Moodle caching

by Alistair Spark -
Picture of Core developers Picture of Particularly helpful Moodlers Picture of Plugin developers
Yes, have been looking at these https://scalegrid.io/blog/6-crucial-redis-monitoring-metrics/

It all looks ok - at least to someone with limited experience of Redis.

Only thing I have noted is the prefix we use is 7 characters - more than the recommended 5. Simple change but seems unlikely to be the problem seeing as we have the same prefix length on pre-prod.
In reply to Alistair Spark

Re: Diagnosing issue with Moodle caching

by Alain Raap -
Picture of Particularly helpful Moodlers
Your hits percentage is very low, should be at least 97%, 98%. Did you check the Redis log for errors like 'THP enabled', 'vm.commit_memory' and other possible problems? This could make a difference in your perfomance.

Other question, you are using Redhat 7, is Selinux enabled in your environment? If so, I'd check the 'denied' errors in your audit.log (cat /var/log/audit/audit.log | grep denied). If you see a lot of errors, you'll probably have to fix this in a Selinux policy file.
In reply to Alistair Spark

Re: Diagnosing issue with Moodle caching

by Ken Task -
Picture of Particularly helpful Moodlers

At risk of being told, if you have to ask that question, then you shouldn't attempt to participate (or offer advice) ... but ...

'.. each of the app servers ... - suggest load balancing setup? or you have one dedicated server per? ... dedicated DB, dedicated 'Files' (NSF), dedicated web server.

And, assuming NSF is a Linux thang, some details concerning OS/specs (including networking) would also give a clearer picture ... me thinks! ;)

'SoS', Ken

In reply to Ken Task

Re: Diagnosing issue with Moodle caching

by Alistair Spark -
Picture of Core developers Picture of Particularly helpful Moodlers Picture of Plugin developers
yes servers are load balanced. No caching on the load balancers.
4 apps (6CPU 16G RAM)
1 master/ 1 slave DB (was hoping to look at MySQL Galera Cluster next year) (6 CPU 48GB RAM)
single NFS server (was hoping to look at GlusterFS next year with one in each DC) (4 CPU 8GB RAM)
admin server to run cron. (4CPU 16GB)
evenly split between two datacenters
In reply to Alistair Spark

Re: Diagnosing issue with Moodle caching

by Alain Raap -
Picture of Particularly helpful Moodlers
Alistair we've also had issues with NFS4 and performance problems on Redhat 7, feel free to send me a DM to see if I can help you.

Here's a post of our problems with NFS and our solution, also take a look at caching of the temp files and the cache and localcache, don't store it on NFS!!

https://moodle.org/mod/forum/discuss.php?d=362709
In reply to Alain Raap

Re: Diagnosing issue with Moodle caching

by Alistair Spark -
Picture of Core developers Picture of Particularly helpful Moodlers Picture of Plugin developers
Hi Alain,

How do you not store temp and cache on the NFS?

they have to be in a shared location for all of the application servers.
In reply to Alistair Spark

Re: Diagnosing issue with Moodle caching

by Alain Raap -
Picture of Particularly helpful Moodlers
In our Moodle environment we don't have more than one application server. We store the cache, localcache and temp files on the application server and that is much faster, also for the temporary backup files that aren't saved over the network anymore. It works fine for us.
Average of ratings: Useful (1)
In reply to Alain Raap

Re: Diagnosing issue with Moodle caching

by Howard Miller -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers
Ok - yes it will. With multiple web servers you don't have that luxury.
In reply to Alain Raap

Re: Diagnosing issue with Moodle caching

by Howard Miller -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers
+1

It's hard to completely avoid caching to the filestore (NFS). Even if you use memcached/Redis for caching, some of it still wants to cache to files (NFS). I must confess that the MUC is largely witchcraft to me. You may have noticed me moaning in the past about the lack of useful documentation and support for what is such a critical feature so I'm going to take the opportunity to do that again big grin

Also, yeh, how do you not have temp files on NFS?
In reply to Alistair Spark

Re: Diagnosing issue with Moodle caching

by Alistair Spark -
Picture of Core developers Picture of Particularly helpful Moodlers Picture of Plugin developers
We seem to have uncovered a really odd behaviour of the Moodlesession cookie token changing multiple times when reloading the pages where the timeouts occur.
Seem to see some profile pictures or web service calls returning Set-Cookie.
In reply to Alistair Spark

Re: Diagnosing issue with Moodle caching

by Alistair Spark -
Picture of Core developers Picture of Particularly helpful Moodlers Picture of Plugin developers
For reference, tracked this down as being an issue with the Apache 2.4 which enabled caching of Moodle resources with Cache-Control public directive such as profile pictures, Site logo, etc but by default Apache caches the Set-Cookie Header with those, hence users being kicked out of their session, can be disabled via:
https://httpd.apache.org/docs/2.4/mod/mod_cache.html#cacheignoreheaders

In the end we just fully disabled apache caching as we should properly be thinking that through if we need it, it became enabled as it's a default / poor scoping internally of new Apache config as a combination of app servers being load balanced and reverse proxy setup leading to mod_proxy enabling this type of caching.
Average of ratings: Useful (2)
In reply to Alistair Spark

Re: Diagnosing issue with Moodle caching

by Sarra Lev -

I'm having both those same problems on moodle cloud. I don't know how to change the settings you are speaking about (or if it is possible on moodle cloud, but if you have suggestions, I'd be appreciative.