Question: Does the server freeze after churning, say 15 hours, or is it unusable all those 15 hours?
Depending on the answer it is possible that that we are talking of two different things: a) automated course backups taking very long - half a day or more b) the server freezes. I don't think both happen at the same time. You wrote "it was seemingly toward the end of the backup judging from the logs".
Assuming there are indeed two things, a) could be normal for those ridiculously large (>10 GB) courses when the tuning tool proposes 9 GB spool but you have 4 GB RAM.
Question: You noticed "live" (top command) 100 MB of swap? That doesn't mean anything. A true swap is comparable to the amount of RAM, order of GB. How much swap space is there?
You must have a monitoring tool running, which _records_ all the critical parameters - including the "steal" Howard mentioned. Munin is the most known one in this forum. (See the attached graph.)
To follow up b) you need developer level logs from https://docs.moodle.org/en/Debugging.
Are you the (Linux) system administrator or the Moodle administrator trying to link the helpers here with your institution's sys-admin? A Linux sys-admin won't call his system just "a Linux" and stops there.
;-(
The graphs you provided already have a lot of information.
- This is regular Moodle usage, no automated course backup happened? If so, the site is under heavy load - day and night(?)
- No "steal". So can drop that topic.
- The two "hills" CPU usage coincide with high iowait. Which usually means db queries waiting for results.
- In you zoomed graph, iowait almost touches the user cpu usage. That is prohibitive.
- The server is swapping all the time. It shouldn't at all! See https://moodle.org/mod/forum/discuss.php?d=395746#p1595879.
Compare with the attached memory graph of a Moodle server under light load. (Physical memory 16 GB). The peaks during early hours are automated course backups.
You should invest time on the last point first. How much is the physical RAM now? What happens if you double it, for comparison? If no immediate reaction, start with the database. See Ken's posts on this subject.
Visit https://lo.rcsd.ca/munin (admin, moodle2020) to view my munin graphs.
There are no automated backups currently going. Backups are all disabled. I tried to run 1 backup of a >5 GB course manually via cli (sudo -u apache php admin/cli/backup.php --courseid=x) at about 7:35AM server time. You can see the CPU spike on the graph.
Moodle was so slow that it was unusable and I had to cancel the backup. Do any of these graphs give an indication of what is causing this? I see that swap spiked during this time but CPU and RAM seem fine..?
I'm going off-line, so just what I saw at a quick glance: Why does the network traffic rise with the backup? Is you target volume mounted through the network?
For me ...
Link to Munion produces:
Unauthorized
This server could not verify that you are authorized to access the document requested. Either you supplied the wrong credentials (e.g., bad password), or your browser doesn't understand how to supply the credentials required.
Besides that ... IMHO ... of all the pieces that affect performance of a Moodle, database server config ... as well as how moodle code interacts with DB server *IS* the most important.
So have you tried installing and running MySQLTuner? And in your case, might have to run it from web server as well as DB server.
Ok, spike was present but what caused the spike? Was it mysqld?
Destination of a CLI backup was missing from command you issued ... which means moodle will use moodledata/temp/backup/ as a build area. Where is moodledata? On the same server as the web service?
One of the checks MySQLTuner makes is overall memory ... memory for DB server ... also memory for other processes server needs to function.
Here's a clip from a Moodle server that has 16Gig under VMWare virtual for a K12 entity with over 9000 users ... typically used in a blended fashion.
[--] Physical Memory : 15.6G
[--] Max MySQL memory : 4.4G
[--] Other process memory: 1.5G
[OK] Maximum reached memory usage: 3.3G (20.96% of installed RAM)
[OK] Maximum possible memory usage: 4.4G (28.44% of installed RAM)
[OK] Overall possible memory usage with other process is compatible with memory available
[OK] Slow queries: 0% (0/52M)
and of InnoDB:
[OK] InnoDB buffer pool / data size: 3.0G/2.2G
buffer pool must be larger than actual data size or DB server will suffer performance.
From same server ... top set to show what's using SWAP:
Mem: 16333820k total, 15987068k used, 346752k free, 1079616k buffers
Swap: 8208380k total, 102808k used, 8105572k free, 10316708k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP COMMAND
7706 mysql 20 0 5157m 3.5g 5168 S 0.3 22.4 201:05.56 60m mysqld
Yep, it's using SWAP ... which isn't good, but considering it's an all in one box not much wiggle room exist for tweaks. Backups cannot be run during prime time usage ... and some courses require CLI backup on weekends when site isn't used as much as 'prime time' during week.
'SoS', Ken
Hesitant to share a my.cnf file from another server ... temping to copy and paste with no real understanding by another reader could result in more issues.
Will comment on what you have shared, however ...
[!!] InnoDB buffer pool / data size: 128.0M/9.0G
not good ...
Recommendations Tuner made:
innodb_buffer_pool_size (>= 9.0G) if possible.
Think you can = to. Percona folks (I think) recommend increasing InnoDB buffer pool instances 1 per 1 Gig ... so in your case 9 ... but have had servers with less instances perform just fine. Most tweaking one does gradually ... run a day ... check again. Could push into diminishing returns ... ie, opposite of desired outcome.
Now will this one tweak solve overall ... no ... like you said, teachers desiring to backup course during prime time you can't control. Can you ID which teachers courses and do some research on those courses?
Could you determine when is prime time and then request of those teachers NOT to backup at those time(s)?
From this:
[OK] Highest usage of available connections: 34% (52/151)
Appears prime time peak ... at least with this run of tuner ... shows only 52 connections and you have plenty of room to spare there with the default max of 151.
One of the other things I install is Logwatch ... runs once a day and sends summary of server activity to designated user. Reason ... one might be surprised how many machines are out there whacking away at your server ... scans for research ... or so they say ... could be hitting your server more frequently than desired and during pime time.
'SoS', Ken
I would do a very careful audit of the machine configuration to check for any external shares etc. that might be lurking.