As the start of our semester nears, we're seeing repeated problems with Moodle's server crashing that appear to be related to scheduled tasks / the admin/cli/cron.php script. We're on Moodle 3.8. I cannot 100% tell what tasks are causing the problem but I'm starting to look through logs and identify ones that take a long time to complete. We have a cron job on the server that runs the cron script every minute (this is recommended practice, no?) but the tasks will take over a minute to complete, causing some weird problems in the logs when the initial "server time" message is recorded but nothing else:
Cron completed at 02:02:07. Memory used 32MB.
Execution took 5.644233 seconds
Server Time: Fri, 22 Jan 2021 02:03:01 -0800
Server Time: Fri, 22 Jan 2021 02:04:01 -0800
Server Time: Fri, 22 Jan 2021 02:05:01 -0800
Skipping processing of scheduled tasks. Concurrency limit reached.
Cron script completed correctly
Cron completed at 02:05:33. Memory used 13.6MB.
Execution took 91.814146
That's from a 2am crash, when traffic on the server is low and we're not running our most intensive tasks either (we run a user synchronization task later in the night). You can see the 2:03-2:04 cron runs have nothing in the logs. Does this mean that the all the scheduled task runners were busy that whole time? I can't figure out what is running out of control but on a hunch I disabled the assignment annotation using ghostscript and the document conversion using unoconv, so calling these external binaries shouldn't overload the server. But I'm not convinced that's solved all the problems. What's odd is we didn't see these problems as often last semester but nothing substantial has changed in between (same server, same Moodle version, no new plugins, etc.).
Assuming we cannot provision any more resources for the server, what are the ways to prevent repeat runs of the cron script from ballooning out of control? When I look at /admin/settings.php?section=taskprocessing it makes it sound like increasing the concurrency limits (we use all the defaults on this page) would make things worse. Would decreasing the concurrency limits to 2 perhaps improve things? Would using something like cpulimit on the php program help or make things worse? I'm really running out of ideas.