We have some performance issues on certain tasks. Overall browsing and navigation on the platform seems a-ok but you try and, for instance, perform a course backup or restore and you can see yourself waiting for 1+ minute for a 30Mb backup.
Since moodledata is stored on EFS, we thought having the temp folder on that FS was to blame but after testing on a frontend, setting the temp folder on the same machine's disk and seeing exactly the same times, we thought otherwise.
We tested the backup tasks running the script from console but when doing it from the Web UI, you can see the process bar filling up when usually (when the platform as not clustered) the process was seamless and instantaneous.
There's something weird going on when accessign files and it does not seem to have anything to do with EFS: Throughput stats, traffic volume and all metrics seem to be OK, no cap reached.
Another clue we see, not related to course backup/restorations are huge wait times when pluginfile.php is invoked. Those requests take from nanoseconds to 5-7 minutes sometimes! - I understand this is used for a wide variety of cases and that's why the time differences, but we have no clue on how to trace this.
The other performance issue we see is when looking up users on the chat module (upper right bubble icon) - sometimes it would take ages to search the user and even end up in 503 errors. I guess this should have nothing to do with the file performance issues, or does it?
Any help to troubleshoot or ideas to try are welcome.