Computer adaptive test with item response analysis

Computer adaptive test with item response analysis

de John Benn -
Número de respuestas: 25

Hi

So far I have just been a user and would like to say that I am really appreciative of all the work people have put into Moodle.

Now to my question - I want to create a series of placement tests for language learners in a number of different languages. The usual thing is to write 50 multichoice questions and let them get on with it. However, I'd like to do better than that. I'd like to reduce the number of questions a student needs to answer before being slotted into a level - answering 50 questions is boring and many people drop out before completion.

The obvious answer is a CAT which chooses the next question presented to a candidate based on their answer to the current question thus cutting down considerably on the questions put. Ideally there should be a module to analyse responses so questions can be improved. Is anyone working on this? Does it exist already?

Thanks

Promedio de valoraciones: -
En respuesta a John Benn

Re: Computer adaptive test with item response analysis

de Tim Hunt -
Imagen de Core developers Imagen de Documentation writers Imagen de Particularly helpful Moodlers Imagen de Peer reviewers Imagen de Plugin developers

No, it does not already exist.

It would be really nice if it did exist, and it has been talked about before in this forum. Try a search for Computer adaptive test or similar phrases.

En respuesta a Tim Hunt

Re: Computer adaptive test with item response analysis

de John Benn -

Thanks - the search didn't seem to be working the day I put this up. I found a document from 2007 where CAT/ IRT was proposed for Moodle although not specifically for language learning. I assume it got put on the backburner. I remember seeing an implementation written in basic a couple of years back although it was lacking item analysis. Maybe I could see what could be done with that.

En respuesta a John Benn

Re: Computer adaptive test with item response analysis

de Gordon Bateson -
Imagen de Core developers Imagen de Peer reviewers Imagen de Plugin developers

Hi John,

the QuizPort module, which is the successor to the HotPot module, for Moodle 1.x may be useful for you. You can add pre- and post-conditions to the quizzes. Pre-conditions prevent access to a quiz until certain conditions on other quizzes have been satisfied. Post-conditions control what is shown next depending on the results of the quiz that was just finished.

You can download the QuizPort module from here:

Docs are here:

Comparison of QuizPort and Lesson module here:

The QuizPort module provides detailed statistics reports on attempts at the quizzes, which include averages and a discrimination index for each question.

regards
Gordon

En respuesta a Gordon Bateson

Re: Computer adaptive test with item response analysis

de John Benn -

Hi Gordon

Thanks for the heads up. Yes, this looks like a solution to me. I'll see what I can do with it. Thanks again!

All the best

John

En respuesta a John Benn

Re: Computer adaptive test with item response analysis

de Joseph Rézeau -
Imagen de Core developers Imagen de Particularly helpful Moodlers Imagen de Plugin developers Imagen de Testers Imagen de Translators

Hi John,

Moodle's Lesson activity can do that, up to a point. It features a "branching" system, which means you can make the jump to the following page based on the answer to e.g. a MCQ question on the current page.

One problem I forsee with your "adaptive" method is that, if a student selects the correct answer to a question simply by chance, they will then be asked a question which does not correspond to their "level". I am not aware of existing language placement tests based on the method you describe. Can you give some references (online)?

Joseph

En respuesta a Joseph Rézeau

Re: Computer adaptive test with item response analysis

de John Benn -

Hi thanks for the responses.

Unfortunately the lesson module does not provide for the analysis of responses and it is doesn't seem sufficiently flexible even in conjunction with a question bank.

The test doesn't stop at the point where a student gives a wrong answer it continues until the standard deviation stays within predetermined boundaries. If a student guesses an answer correctly the next question will be more difficult so they are likely to get it wrong and be presented with another question. This is why item response analysis is so important - it helps to grade the questions accurately and to weed out bad questions.

The GMAT uses a CAT for Mathematics and for the English language components. This is not a placement test of course but the topic has been discussed and there is no real reason why it couldn't be applied to such tests

Here is a link to the University of Minnesota site where there is some general information about CAT  for language testing.

 http://www.carla.umn.edu/assessment/CATfaq.html

The second link is a rather elderly tutorial on CAT and IRT - things have moved on a long way since then but the theory is broadly the same - the math given you might want to skip.

http://echo.edres.org:8080/scripts/cat/catdemo.htm

En respuesta a John Benn

Re: Computer adaptive test with item response analysis

de Glenys Hanson -

Hi Joseph,

Where I used to work, for some time we had the OUP Quick Placement Test. It's an adaptive test which relates the scores to ALTE levels. The test itself seemed fine (attractively presented with pictures) but we had a lot of problems installing it and it was quite expensive too so we went back to using our inhouse test: not adaptive but no technical problems. It's the usual kind of 50 question grammar QCMs. It takes about 30 minutes to complete but we don't get many complaints about the time it takes.

Cheers,

Glenys

En respuesta a Glenys Hanson

Re: Computer adaptive test with item response analysis

de John Benn -

Hi Glenys

I know your message wasn't addressed to me but as it relates to my original post I thought I would respond to it. I too looked at the OUP test but since they only allow it on workstations and not on a server it is of no use to us. We have some 2000 test takers a month on our current test. The vast majority, of course, don't come to us - we regard it as a kind of community service to many independent teachers who get their students to take it. This is good for us too as we provide teacher training services as well. Anyway, this is why a third party test that we can't host ourselves is not what we need. I was interested to hear that you had problems instaalling it though. I have suggested it to some friends in the past - perhaps I shouldn't in the future. Thanks

En respuesta a John Benn

Re: Computer adaptive test with item response analysis

de Mark Datz -
En respuesta a Mark Datz

Re: Computer adaptive test with item response analysis

de Tim Hunt -
Imagen de Core developers Imagen de Documentation writers Imagen de Particularly helpful Moodlers Imagen de Peer reviewers Imagen de Plugin developers

Is that the right URL? It seems to go to a page of search results.

En respuesta a Tim Hunt

Re: Computer adaptive test with item response analysis

de Mark Datz -

Oops, sorry about that, I'll try again

http://tinyurl.com/2canpv2

En respuesta a Mark Datz

Re: Computer adaptive test with item response analysis

de Tim Hunt -
Imagen de Core developers Imagen de Documentation writers Imagen de Particularly helpful Moodlers Imagen de Peer reviewers Imagen de Plugin developers

Thanks, that works. Also, that you for posting a link to a paper here. I feel it would raise the level of debate here if more people provided links to research to back-up their arguments. (Of course, it would be nice if more research was open-access.)

I have very mixed feelings about CAT.

On the one hand, as a mathematician I think that the statistical modelling behind it is interesting - it bears some relation to the modelling behind ELO chess ratings and the equivalent for Go. And as a software developer educational technologist, it would clearly be an interesting thing to implement and use.

However, the other part of me wonders whether the whole thing can actually work. Basically, are the underlying assumptions valid? The assumptions seem to be that

  • some important aspect of people can usefully be measured on a one-dimensional scale;
  • assessment items can be measured on the same scale.
  • we can assemble a large-ish bank of assessment items, whose position on that scale is known; and

For instance the paper you cite first says "An essential concept underlying almost all ability or attitude testing is that the abilities or attitudescan be ranked along one dimension." but then later when taking about stopping conditions says "The CAT test cannot stop before: ... every test topic area has been covered." How can there be more than one topic area in a unidimensional test? Oops.

I was very taken with the book http://www.dorsethouse.com/books/mmpo.html when I read it. It is basically the book of Robert D. Austin's PhD thesis, where he models what happens when you apply low-dimensional measurement systems to people. His conclusion is that the situation is inherently dysfunctional.

To translate that back into teachers' language, what that book says is that if all you care about is exam scores, then teachers will teach to the test, and students will study to the test, and learning suffers. I guess we all knew that, but the book goes a bit further to demonstrate that you cannot fix the problem by trying to write more sophisticated exams.

Of course, almost all testing is one-dimensional, and so we have to be very careful how we apply it and interpret the results, but CAT seems to rely on the one-dimensional assumption far more, and so be more vulnerable to breaking in the real world.

A lot of the advantages given in the paper for CAT would apply equally well to a plain Moodle quiz. That is, they are noting the advantages of the C, and not necessarily of the A bit.

Still, even with all my concerns, I think it is a cool idea, and I would like to see a CAT module for Moodle. Also, it clearly merits more research. Like all tools, there are probably times when it is the best tool for the job, and plenty of other times when it is inappropriate.

P.S. I LOLed at this sentence: "Fortunately, the description Schoonman gives of his complicated Bayesian algorithm is sufficiently opaque to inhibit others from trying to copy this part of his work."

En respuesta a Tim Hunt

Re: Computer adaptive test with item response analysis

de John Benn -

What you say is interesting and suggests that perhaps I should consider a different approach. If I explain in a little more detail our aims for the test, perhaps you could comment on the appropriateness of the CAT test for this purpose?

First of all, about the test dictating what is taught - this problem is well-known to us and is definitely a huge problem in the state sysem in the country in which I work. Unfortunately, I am not in a position to change this. Our aims are somewhat different from the state system as we are involved in teaching language for real use. Having said that, we do on occasions,  teach for international examinations, TOEFL, IELTS, CPE etc. however, the vast majority of our students have little or no interest in taking such examinations.

The test I am interested in developing is not an achievement test - that is it does not set out to test what has been taught. Rather it's sole purpose is to identify what a candidate is capable of doing in the target language and compare that against criteria laid down in the Common European Framework of Reference for Languages. These are essentially a series of 'can do' statements which have been grouped into bands. Here is a link to a document where you will see an example of these.

http://www.coe.int/T/DG4/Portfolio/?L=E&M=/main_pages/levels.html

Our purpose is not to compare one student with another but to assign students to the CEFR bands. While we recognise that language learning is a continuum and not really a series of discrete 'levels', the reality is that language teachers have to sort students into rough levels in order to teach them effectively. Indeed, not doing this is the main reason for failure in the state system. So, since our aim is to rough sort and since we do have clear communication-based objectives and have several thousand tried and tested questions available do you think a CAT is viable? I should mention that all students, apart from absolute beginners,  are also given a separate face-to-face speaking test as part of the placement procedure.

I really am grateful for your input on this. Thanks

En respuesta a John Benn

Re: Computer adaptive test with item response analysis

de Mark Datz -

I would suggest starting with Tim McNamara's book "Measuring Second Language Performance".

 

Next, Gabor Szabo's "Applying Item Response Theory to Language Test Item Bank Building"

 

Plus Bond and Fox's "Applying the Rasch Model" for an introduction to IRT.

 

Peter Tymms of Durham University has developed CAT software that should suit your needs

http://www.dur.ac.uk/education/staff/?id=640

 

Assessment Systems have commercial software that might be worth checking out

http://www.assess.com/xcart/product.php?productid=273

En respuesta a John Benn

Re: Computer adaptive test with item response analysis

de Tim Hunt -
Imagen de Core developers Imagen de Documentation writers Imagen de Particularly helpful Moodlers Imagen de Peer reviewers Imagen de Plugin developers

I suppose this might be a situation CAT would work: You have students where you have very little idea of their initial ability, and you want to place then very roughly along a very long ability scale. In that case, you probably do need questions that test at a very wide range of levels, and great accuracy is not important.

On the other hand, in your case, you are trying to group students into one of only six levels. You could handle that with, say, three standard Moodle quizzes, an Easy, Medium and Hard one. Start with a welcome screen that asks the student to guess their own ability, and then click a link to the corresponding test. Let us suppose the guess they are Medium ability. Then, they do that test. if they do quite well, rate them B2, if they do less well rate them B1, if they do really badly, give them a link to the Easy test and ask them to do that, which will try to rate them as A1 or A2 (but what if they get a really high mark on that, you might give them a link back to the Medium test!). Similarly, if a student does really well on the Medium test, you give them a link to the hard test.

You can set all this up in Moodle using the 'Overall feedback' fields on the quiz settings form. Although these look like plain text boxes, you can actually type any HTML (like links) in there.

En respuesta a Tim Hunt

Re: Computer adaptive test with item response analysis

de Mark Datz -

Tim, you raise some very important points which should be thought through carefully before attempting to create a CAT module.

 

Mike Linacre, the author of the paper linked to above, is the programmer behind the Rasch analysis packages "Winsteps" and "Facets", details at www.winsteps.com. Rasch analysis is related to a family of psychometric models called item response theory, or IRT for short. These are very widely used, see for example this paper by Edward Wolfe

http://tinyurl.com/2c2mogr

 

Unidimensionality has been argued about for decades in the psychometric literature. Basically, in order to give a meaningful summed score from a test, all items must have a shared underlying unidimensional trait. Even in a very carefully developed test, there will be other small dimensions within the data. Ideally every question will contribute to the main factor plus also contribute a very small unique factor, so technically there will be as many dimensions as there are questions. The question is thus not whether a test is perfectly unidimensional (only a test with a single question will meet that standard), but whether the main trait is sufficiently large that the dataset is "usefully" unidimensional or "essentially" unidimensional. If it is not, then you either need to report the results as two or more separate scores (which you cannot meaningfully add together) or use a multidimensional model such as Wu and Adams discuss

http://tinyurl.com/2co57e9

 

Multidimensional models are complex, controversial, and need extremely large datasets to give any advantage over unidimensional models, so these are utterly impractical for Moodle. Basically, if you want to implement CAT, an assumption that you can't avoid is that all your test questions must largely measure the same unidimensional trait. Many classroom assessments will not meet this standard, so CAT will only be of benefit for users with some grounding in psychometrics and test development. An analogy is comparing a rocket with a bicycle, rockets are faster, but they require users to have a bit more technical know-how to actually be beneficial.

 

Another fundamental assumption of the IRT model Linacre uses is that persons and items can be mapped onto a single invariant scale. This is essentially built into the definition of measurement underlying that model, so if you don't accept that assumption, you need a different model. Unfortunately,  rejecting that assumption implicitly rejects CAT, which requires that persons and test items can be mapped onto a common invariant scale of measurement, otherwise, what basis can you use for matching persons and items?

 

Those are theoretical issues that have been argued about for decades in the psychometric literature. Personally, I'm satisfied that the theoretical objections to CAT have been adequately addressed, but I'm unconvinced that a Moodle CAT module is a practical undertaking. CAT is of questionable value for classroom level testing, you really need a minimum of 1000 students to make it work, ideally considerably more, so it's not clear to me that there would be much genuine demand for it.

 

A related problem is that you need very large item banks for CAT to be effective. Test security is critical, so you can't recycle questions too frequently to minimise the chance of students getting the same questions that their friends got in an earlier administration. Although it is apparently possible to pilot CAT with as few as 200 questions, 2000 or more questions seems to be a more realistic minimum, plus continual development of new questions to replace older ones as they become compromised. In other words, if you were to use it for a high-stakes test, you are probably going to need to write 2000 high-quality questions in the first year, and then another 100 per month for as long as you use the test. Classroom teachers rarely have training in writing test questions, so the majority of teacher written items will probably not perform consistently enough to be suitable for CAT.

 

Finally, there are some basic technical issues with trying to administer CAT over a network rather than as a local installation. Security is an obvious question, high-stakes CAT just doesn't work if the item bank isn't secure, and I don't see the point of trying to use CAT for low-stakes testing. Another is that every time a student responds to a CAT question, an algorithm must be run to decide whether to administer another question and what question is most suitable to be administered next. If all that processing is done by a central server, then performance will suffer during high demand, or it may crash the server completely. Results from students who take the tests during busy times may thus be penalised unless some thought is given to these issues.

 

On the whole, CAT is technically fascinating development, but for most testing purposes, it strikes me as a solution looking for a problem. It's not something I would rush into without a great deal of consideration.

En respuesta a Mark Datz

Re: Computer adaptive test with item response analysis

de Tim Hunt -
Imagen de Core developers Imagen de Documentation writers Imagen de Particularly helpful Moodlers Imagen de Peer reviewers Imagen de Plugin developers

That is a very good summary of why to be cautious about CAT. Thank you. I can't actually think of anything to add!

En respuesta a Tim Hunt

Re: Computer adaptive test with item response analysis

de Tim Hunt -
Imagen de Core developers Imagen de Documentation writers Imagen de Particularly helpful Moodlers Imagen de Peer reviewers Imagen de Plugin developers

Actually, I did think of something I wanted to add to this thread.

So far, most of the discussion above has focussed on the 'classic' CAT activity: "Determine this students ability measured on this scale by asking as few questions as possible."

However, that is not the only thing you can do with CAT. Remember the genera scheme is:

  1. Based on the questions the student has already answered, and their responses, decide which question to ask next.
  2. Pose them that question, record their response, and grade it.
  3. Decide whether to stop now, or go back to step 1.

That is the general scheme, but the decisions at step can be made in many different ways depending on what you would like to achieve. Here are some activities you could create using this general scheme:

  • A revision activity, where the student gets shown questions completely at random from the course question bank (that they have not done already) until they choose to stop themselves.
  • A vocabulary learning tool, that keeps asking the questions that test vocabulary knowledge until the student has got them all right, focussing the re-testing on the questions that student has got wrong most frequently.

It is interesting applications like this that make me interested in seeing a CAT module for Moodle. From a software point of view, it should be implemented as an activity module that follows the basic scheme, with sub-plugins for the different ways you can implement the decisions at steps 1 and 3.

En respuesta a Tim Hunt

Re: Computer adaptive test with item response analysis

de Mark Datz -

Maybe best not to call it CAT (Computer Adaptive Test). This is a technical term.  Your proposals have merit in my opinion, but they're not what testing people mean by CAT (they're not really adaptive) so maybe it's better to call them something different.

En respuesta a Mark Datz

Re: Computer adaptive test with item response analysis

de Tim Hunt -
Imagen de Core developers Imagen de Documentation writers Imagen de Particularly helpful Moodlers Imagen de Peer reviewers Imagen de Plugin developers

Good point. Any suggestions for a good name?

  • Question practice
  • Adaptive questionning
  • ...

surely someone can do better than that.

En respuesta a Tim Hunt

Re: Computer adaptive test with item response analysis

de Joseph Rézeau -
Imagen de Core developers Imagen de Particularly helpful Moodlers Imagen de Plugin developers Imagen de Testers Imagen de Translators

Tim:

[...] A vocabulary learning tool, that keeps asking the questions that test vocabulary knowledge until the student has got them all right, focussing the re-testing on the questions that student has got wrong most frequently.

There is a 3rd-party module which does just that, based on matching cards: Flashcard by Valery Fremaux.

Joseph

En respuesta a Joseph Rézeau

Re: Computer adaptive test with item response analysis

de Tim Hunt -
Imagen de Core developers Imagen de Documentation writers Imagen de Particularly helpful Moodlers Imagen de Peer reviewers Imagen de Plugin developers

Yes, but flashcard does not use questions from the question bank. I had in mind a module that did. I think there is a place for both options.

En respuesta a Tim Hunt

Re: Computer adaptive test with item response analysis

de Joseph Rézeau -
Imagen de Core developers Imagen de Particularly helpful Moodlers Imagen de Plugin developers Imagen de Testers Imagen de Translators

Tim:

flashcard does not use questions from the question bank

Yes it does. It uses matching questions from the question bank. At least it did last time I tried it. Not sure if Flashcard is actively maintained, though.

En respuesta a Tim Hunt

Re: Computer adaptive test with item response analysis

de ben reynolds -

One obvious use of this strategy is a math diagnostic placement test.

Start with adding, then subtracting, multiplying, dividing, then get to the uglier stuff, etc. In fact, it would be most fun to design a diagnostic placement test that worked in reverse order of difficulty.

4 digit addition, can you do it? No. Test again. No.
3 digit addition, can you do it? Yes. Test again. Yes.

3 digit subtraction, can you do it? Yes. Test again. Yes.
4 digit subtraction, can you do it? Yes. Test again. No.

Good enough, next subject.

You can tell I'm not a math guy because I'm pretty sure there's no difference between 2 digit, 3 digit and 4 digit addition/subtraction except the greater possibility of error.