Moodle in English: Sudden slowdown, seemingly related to get_user_access_sitewide()

Sudden slowdown, seemingly related to get_user_access_sitewide()

by David Siebörger - Tuesday, 13 October 2009, 12:12 AM

Our Moodle site had been running quite happily for a number of months, but suddenly, for the past few days, it's running into performance problems. No-one here is aware of any changes in usage that could explain the increased load.

The first sign was that users started seeing the "Error: Database connection failed". That was because MySQL had reached its connection limit. top showed that mysqld was using all available CPU time, and running "show processlist" in MySQL shows a large number of threads executing the last query in get_user_access_sitewide() stuck in the "Copying to tmp table" state -- like this:

| 325 | moodleuser | localhost | moodle | Query | 0 | Copying to tmp table | SELECT sctx.path, ra.roleid,
 ctx.path AS parentpath,
 rco.capability, rco.permission
 FROM mdl_role_assignments ra
 JOIN mdl_context ctx
 ON ra.contextid=ctx.id
 JOIN mdl_context sctx
 ON (sctx.path LIKE CONCAT(ctx.path,'/%') )
 JOIN mdl_role_capabilities rco
 ON (rco.roleid=ra.roleid AND rco.contextid=sctx.id)
 WHERE ra.userid = 11095
 AND ctx.contextlevel <= 40
 AND sctx.contextlevel <= 50
 ORDER BY sctx.depth, sctx.path, ra.roleid |

While I do see some other queries in the processlist, there's usually only one or two of them, with up to 20 of those get_user_access_sitewide() queries. Sometimes, there are up to 5 _identical_ (i.e. with userid set to the same value) queries running simultaneously -- which seems weird?

The three tables involved (mdl_role_assignments, mdl_context, mdl_role_capabilities) contain 50516, 64108 and 7681 records respectively. There are 7097 user accounts in Moodle.

I've checked all sorts of things to try to make that query execute faster: running myisamchk, optimize table, analyze table, converting those three tables to InnoDB, increasing the size of the query cache, checking that all the right indexes exist (by comparing against a clean install). Nothing's really helped: I'm coming to the conclusion that the query is just plain slow, but that it's being run too often. Explain and MySQL profiling show that the query spends all its time "Copying to tmp table" because of the "order by".

What has helped a little has been adding more VCPUs to the virtual machine, but that's just buying a bit more headroom, not actually solving the problem. I'm still seeing those queries queuing up in the processlist but less often.

Does anyone have any ideas? Would converting to PostgreSQL fare better in this circumstance? I've been hesitant to try, because I'm more familiar with MySQL and the conversion seems quite complex -- and if PostgreSQL also struggles with the same query, then I've achieved nothing.

Environment: Moodle 1.9.5, PHP 5.2, MySQL 5.0.86, FreeBSD/i386 7.1-RELEASE
Hardware: 2/4 VCPUs, 3 GB RAM, VMware ESX 3.5 (the host is not heavily loaded)
Usage: 50-100 simulatenous users in peak times

Average of ratings: -

Re: Sudden slowdown, seemingly related to get_user_access_sitewide()

by Juan Segarra Montesinos - Wednesday, 14 October 2009, 1:52 AM

Hi David

I'm facing the same problem here... with no luck at the moment. I have not much time now sad

Can you send an EXPLAIN of the query?

EXPLAIN SELECT sctx.path, ra.roleid,
 ctx.path AS parentpath,
 rco.capability, rco.permission
 FROM mdl_role_assignments ra
 JOIN mdl_context ctx
 ON ra.contextid=ctx.id
 JOIN mdl_context sctx
 ON (sctx.path LIKE CONCAT(ctx.path,'/%') )
 JOIN mdl_role_capabilities rco
 ON (rco.roleid=ra.roleid AND rco.contextid=sctx.id)
 WHERE ra.userid = 11095
 AND ctx.contextlevel <= 40
 AND sctx.contextlevel <= 50
 ORDER BY sctx.depth, sctx.path, ra.roleid

And maybe some server setting like join_buffer_size, etc.

Maybe someone here can help us. I'll try to provide feedback too ;)

Regards

Average of ratings: -

Re: Sudden slowdown, seemingly related to get_user_access_sitewide()

by David Siebörger - Wednesday, 14 October 2009, 5:29 AM

Here's the output from EXPLAIN:

*************************** 1. row ***************************
 id: 1
 select_type: SIMPLE
 table: ra
 type: ref
possible_keys: contextid-roleid-userid,roleid,contextid,userid
 key: userid
 key_len: 4
 ref: const
 rows: 11
 Extra: Using temporary; Using filesort
*************************** 2. row ***************************
 id: 1
 select_type: SIMPLE
 table: rco
 type: ref
possible_keys: roleid-contextid-capability,roleid,contextid
 key: roleid-contextid-capability
 key_len: 4
 ref: moodle.ra.roleid
 rows: 76
 Extra:
*************************** 3. row ***************************
 id: 1
 select_type: SIMPLE
 table: ctx
 type: eq_ref
possible_keys: PRIMARY,mdl_cont_conins_uix
 key: PRIMARY
 key_len: 4
 ref: moodle.ra.contextid
 rows: 1
 Extra: Using where
*************************** 4. row ***************************
 id: 1
 select_type: SIMPLE
 table: sctx
 type: eq_ref
possible_keys: PRIMARY,mdl_cont_conins_uix
 key: PRIMARY
 key_len: 4
 ref: moodle.rco.contextid
 rows: 1
 Extra: Using where

The "Using temporary; Using filesort" in the first row is what really hurts. If I remove the ORDER BY from the query, "Using temporary; Using filesort" disappears from the EXPLAIN output, and the query is almost instantaneous. I trust that the ORDER BY is significant to the get_user_access_sitewide() function, so I haven't tried removing it.

These are the tuning options from my.cnf:

key_buffer_size=128M
table_cache=768
thread_cache_size=8
max_heap_table_size=256M
tmp_table_size=192M
query_cache_size=60M
query_cache_limit=5M
join_buffer_size=3M

I've experimented with increasing join_buffer_size beyond that, but found no improvement.

Average of ratings: -

Re: Sudden slowdown, seemingly related to get_user_access_sitewide()

by David Siebörger - Wednesday, 21 October 2009, 4:25 PM

I seem to have fixed -- or worked around -- the problem, and our Moodle's running happily again now.

I tried a whole lot of things that made no perceptible difference, but two things did help:

switching to a memcached record cache made a slight (but noticeable) improvement. I'd previously tried internal and eAccelerator record caches (I've been using eAccelerator for PHP caching for ages) record caches, but they didn't seem to help at all. memcached was actually quite easy to install.
switching from the MyISAM storage engine to InnoDB made a huge difference. I reckoned I'd try that before trying PostgreSQL because the migration is easy (it's just a case of setting up InnoDB storage then "ALTER TABLE x ENGINE=InnoDB" for each table), and I knew there's a migration path back to MyISAM if it turned out to make things worse.

Average of ratings: Useful (1)

Re: Sudden slowdown, seemingly related to get_user_access_sitewide()

by Juan Segarra Montesinos - Wednesday, 21 October 2009, 7:02 PM

Nice news ;)

I've not seen the problem for some weeks now. I have always had registry cache and Innodb.

Maybe problem had to do with some weird loking problem or some intensive I/O operations... don't know. I'm waiting to see this again ;)

Congratulations

Average of ratings: -

Hardware and performance

Sudden slowdown, seemingly related to get_user_access_sitewide()

Sudden slowdown, seemingly related to get_user_access_sitewide()

Re: Sudden slowdown, seemingly related to get_user_access_sitewide()

Re: Sudden slowdown, seemingly related to get_user_access_sitewide()

Re: Sudden slowdown, seemingly related to get_user_access_sitewide()

Re: Sudden slowdown, seemingly related to get_user_access_sitewide()

Empowering educators to improve our world

Moodle

Support

Get Involved

Contributions

Downloads

Tracker

Development

Empowering educators to improve our world