Performance issues on clustered environment

Performance issues on clustered environment

Red Nano -
回帖数:13

We have some performance issues on certain tasks. Overall browsing and navigation on the platform seems a-ok but you try and, for instance, perform a course backup or restore and you can see yourself waiting for 1+ minute for a 30Mb backup.

Since moodledata is stored on EFS, we thought having the temp folder on that FS was to blame but after testing on a frontend, setting the temp folder on the same machine's disk and seeing exactly the same times, we thought otherwise.

We tested the backup tasks running the script from console but when doing it from the Web UI, you can see the process bar filling up when usually (when the platform as not clustered) the process was seamless and instantaneous.

There's something weird going on when accessign files and it does not seem to have anything to do with EFS: Throughput stats, traffic volume and all metrics seem to be OK, no cap reached.

Another clue we see, not related to course backup/restorations are huge wait times when pluginfile.php is invoked. Those requests take from nanoseconds to 5-7 minutes sometimes! - I understand this is used for a wide variety of cases and that's why the time differences, but we have no clue on how to trace this.


The other performance issue we see is when looking up users on the chat module (upper right bubble icon) - sometimes it would take ages to search the user and even end up in 503 errors. I guess this should have nothing to do with the file performance issues, or does it?


Any help to troubleshoot or ideas to try are welcome.

回复Red Nano

Re: Performance issues on clustered environment

Howard Miller -
Core developers的头像 Documentation writers的头像 Particularly helpful Moodlers的头像 Peer reviewers的头像 Plugin developers的头像
"perform a course backup or restore and you can see yourself waiting for 1+ minute for a 30Mb backup"

That's about what I would expect. Backup and restore are not quick.
回复Howard Miller

Re: Performance issues on clustered environment

Red Nano -
FOr some reason, on a standalone setup, this process was waaaay faster tho.,
回复Red Nano

Re: Performance issues on clustered environment

Howard Miller -
Core developers的头像 Documentation writers的头像 Particularly helpful Moodlers的头像 Peer reviewers的头像 Plugin developers的头像
The first thing I ask about is Caching. If you are caching to NFS (which would be the default) then everything is going to be slow. If in doubt, get yourself Redis. Move the sessions and the caches to redis and the locking to the database.
回复Howard Miller

Re: Performance issues on clustered environment

Red Nano -
Thanks for the reply!
Caching is being done to NFS (Well, amazon's EFS but still, the same)
Local cahes are being performed on each frontend on ramdisks
What is still not clear to me is how to do the locking (I guess you mean file locking) to the database: Can you point me to the procedure to configure this?
Regards!
回复Howard Miller

Re: Performance issues on clustered environment

Red Nano -
Ok, so I set both App and session modes to redis.
Made sure locking is set to database but I still have the same issues.
I hope I did it right: On the cache configuration page, I set both Application and Session storage mappings to a redis storage that I previously defined and thats all, right?
回复Red Nano

Re: Performance issues on clustered environment

Visvanath Ratnaweera -
Particularly helpful Moodlers的头像 Translators的头像
A random guess: Do have have https://moodle.org/plugins/tool_asynccourseimport in operation?

Do you say, the performance varies violently for the same task? Either you need a detailed debug trace, either through Moodle's https://docs.moodle.org/en/Debugging or through the browser development tools, or correlate the incidents to some other incident, like busy hours, people/Moodle taking backups, other timed tasks in the background,..
回复Visvanath Ratnaweera

Re: Performance issues on clustered environment

Red Nano -
Hi and thanks for the reply.
No, the task always takes the same time to be performed. The difference comes when compared to a non-distributed environment, where the same task on the same course would take significantly less time.
I point out to the backup and restore operations because it might have something to do with what's wrong with the whole infrastucture.
We have reports of slowliness but it's always related to tasks other than normal browsing. i.e.: While browsing,t he infrastructure seems quite fast but where we noticed the most impact was while performing the backup and restore operations.
回复Red Nano

Re: Performance issues on clustered environment

Visvanath Ratnaweera -
Particularly helpful Moodlers的头像 Translators的头像
Is this a continuation of your problem from July, How do I improvbe file read / write on NFS (EFS)? or that from April, Clustered Moodle disk access speed (NFS)? Or to put it another way, have you solved those problems? It is always good to post a report once a problem is solved.

So you say, the clustered version performs worse that the non-clustered version? To start from the beginning, what was the problem the non-clustered version had, that made you go for a cluster?


回复Visvanath Ratnaweera

Re: Performance issues on clustered environment

Red Nano -
Seems like those issues were fixed.
After a lot of troubleshooting, we saw we were not reaching any data access speed caps on AWS and decided that the disk access speeds were ok.
The platform overall performance seems fine but for some process, it looks like it's reaching some timeouts or some very slow process.
This process seem to be, for example: Course backup and restore process, user lookup on the IM bubble icon (User session would lock up for 5 minnutes!) and overall slowliness on queries involving "pluginfile.php" - these are the issues we are trying to fix today.

We went fom on-premise to AWS because last year we experienced a sudden spike on the concurrency of users and had to tune the whole platform on the go (Thanks covid!)
We ended up migrating to cloud after a couple of outages in which our platform experienced connectivity downtimes and saw it was the perfect moment to take advantage of the elasticity and availability of the cloud platform.
回复Red Nano

Ri: Re: Performance issues on clustered environment

Sergio Rabellino -
Particularly helpful Moodlers的头像 Plugin developers的头像
IMHO all of your problems are about the lock factory: check what are you using and test the effectiveness of your locking semantics.
回复Sergio Rabellino

Re: Ri: Re: Performance issues on clustered environment

Red Nano -
Ok, can you guide me to an article or point me in the right direction?
回复Red Nano

Ri: Re: Ri: Re: Performance issues on clustered environment

Sergio Rabellino -
Particularly helpful Moodlers的头像 Plugin developers的头像
These are the dev pages on locking: https://docs.moodle.org/dev/Lock_API
If you didn't change the config.php, the locking factory should be done against the database. In clustered environment I usually prefer a special crafted redis instance (sentinel based or not) for locking only activities and this plugin: https://github.com/open-lms-open-source/moodle-local_redislock

(cfr: https://moodle.org/mod/forum/discuss.php?d=398444 )
回复Red Nano

Re: Performance issues on clustered environment

Visvanath Ratnaweera -
Particularly helpful Moodlers的头像 Translators的头像
> Seems like those issues were fixed.

You don't know? Then they can come back, without you knowing! Start there, draw a line by posting the reports and then come back.

> The platform overall performance seems fine but for some process, it looks like it's reaching some timeouts or some very slow process.

Reminded me a hiatus I had with an over-treated FreeBSD 12.1. I thought I knew FreeBSD from some work a long time ago; that was a mistake. Luckily there was somebody in the team who knew FreeBSD.The morale is: you have to know your operating system. The things you are talking about, AWS and Elastic and so on, is Greek to me. So can't contribute productively, I'm sorry.