tricky sql problem in development of quiz reports for Moodle 2.0

tricky sql problem in development of quiz reports for Moodle 2.0

by Jamie Pratt -
Number of replies: 5
In the new quiz statistics report we have a somewhat tricky sql problem. We would like to be able to do the following calculation in the report :




$$C_{max}(x_p, X_p)$$ is defined as follows:

When you compute $$C(x_p, X_p)$$, you do the sum

$$\displaystyle C(x_p, X_p) = \frac{1}{S - 1} \sum_{s \in S} (x_p(s) - \bar{x}_p)(X_p(s) - \bar{X}_p)$$

which involves a term for each student combining their question score and rest of test score. That is, you start with an array of $$x_p(s)$$ with an array of corresponding $$X_p(s)$$, one for each $$s$$. To compute $$C_{max}(x_p, X_p)$$, you just sort these two arrays before applying the above formula. That is, for the purpose of computing $$C_{max}$$, you pretend that the first student scored the lowest $$x_p$$ and the lowest $$X_p$$, the second student scored the second lowest $$x_p$$ and the second lowest $$X_p$$, and so on to the last student, who scored the highest $$x_p$$ and $$X_p$$.




I have managed to do the bulk of the work calculating skewness, kurtosis and standard deviation of sets of grades in sql and avoided having to load the whole set of grades into memory. See here for some background of the calculations we are doing and here for the how we are calculating these in practise using sql that will work cross-db.

But both me and Tim Hunt are stumped as to how to do the $$C_{max}(x_p, X_p)$$ calc in sql.

$$\displaystyle C(x_p, X_p) = \frac{1}{S - 1} \sum_{s \in S} (x_p(s) - \bar{x}_p)(X_p(s) - \bar{X}_p)$$

Could be calculated with the following sql :

SUM((qs.grade - ?)*((qa.sumgrades - qs.grade) - ?))

Since qs.grade is selecting the grade for a question from the question state table and qa.sumgrades selects the whole attempt grade from the question attempt table.

We would fetch the averages in another query and replace the ? marks with the appropriate average. We would then divide the result by S-1 in php.

The problem - a general sql question


The problem with the $$C_{max}(x_p, X_p)$$ calculation is how to take two columns and seperately sort them. If we could do this we could feed the columns into our SUM calculation.

We thought it might be possible to use an extra table and do an INSERT ... SELECT ..... ORDER BY ... and then maybe do a JOIN to get the data out of that table. But there is no way number the rows we are inserting into the db and so to get them out in the order we are putting them in and join the values extracted and sorted seperately back into one table.


Average of ratings: -
In reply to Jamie Pratt

Re: tricky sql problem in development of quiz reports for Moodle 2.0

by Jamie Pratt -
Oh good! The Latex filter displays the equations correctly in Using Moodle.

But strangely indentation has stopped working mixed and I had to use those horizontal rulers above to mark quoted text.
In reply to Jamie Pratt

Re: tricky sql problem in development of quiz reports for Moodle 2.0

by Howard Miller -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers
As a matter of interest what actually *is* $$ C_{max}(x_p,X_p) $$ ? I have a maths degree, but it was 20-something years ago and I have next to no idea what you are talking about big grin

Plus... it's only a report, why do you need to do this all at once? There's no big performance issue is there?
In reply to Howard Miller

Re: tricky sql problem in development of quiz reports for Moodle 2.0

by Tim Hunt -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers
Well, you first need to understand that $$C(x_p, X_p)$$ is the correlation between the score the student got on this question, and the score they got on the rest of the test. That is a measure of how reliable this question is. For example, if the 'clever' students (that is, the ones that got a high mark on the test) got a low score for this question, and the 'stupid' students got a high mark, or even if the scores on this question are just random, and seem to be independent of the student's ability, then that is probably an indication that there is something wrong with this question.

Now the question is, what should you compare $$C(x_p, X_p)$$ with? what is good? Becuase of the way it is defined, it will always be between 1 and -1, so it is tempting to say that close to 1 is good, and 0 or less is bad. However it is not that simple. Depending on the distribution of scores that students got for this question, for example if it is an easy question that most people got right, then actially it is not possible for $$C(x_p, X_p)$$ to be 1.

So $$C_{max}(x_p, X_p)$$ is a way to calculate the biggest possible value for $$C(x_p, X_p)$$, given the distribution of scores that you have. The calculation is described at the link Jamie gave. The a value of $$C$$ close to $$C_{max}$$ is good, and one less than that is bad.
In reply to Howard Miller

Re: tricky sql problem in development of quiz reports for Moodle 2.0

by Jamie Pratt -
Plus... it's only a report, why do you need to do this all at once? There's no big performance issue is there?

I want to avoid having to load all the grades from the newest graded states into memory to do the calculation. There are a lot of calculations on the same page, I managed to do most of the calculations in sql without loading grades into memory but I can see no way of doing the sum for this calculation in sql because it is not possible to sort two columns independently in sql.

I don't think it is that important to do this in sql. Instead we may, IF the calculation takes a long time for a large data set, cache the results of the calculations, and have an indication of when the stats where last calculated and a recalculate now button on the report.