Hi Lael -
Look at the documentation. The change of scoring in 2.6 was just a matter of scaling: certainty expressed as C=1,2,3 now gives marks 1,2,3 if correct and 0,-2,-6 if incorrect. In earlier versions the default marks in the unpatched code were 0.33,0.67,1 or 0,-.67,-2 unless you altered the weight of every question to x3. Integer marks make CBM simpler and more intuitive. More fundamental though is the inclusion of proper feedback in 2.6 (or with the patches), and presentation of the scores in a more sensible way than just as a % of maximum possible marks (see the long discussion here).
The way to deal with students who don't feel confident enough to be sure they are giving a correct answer is to encourage them to start with C=1 (1/0 marking) and change to C=2 when they are a bit confident (2/-2 marking) without too big a risk and with the prospect of doubling their marks if correct. When they come across Qs where they are really sure (which of course they will do), then they can use C=3 (3/-6 marking). This will encourage them to check answers and think more about them. -6 means you've claimed to be sure of something incorrect - a reality check and wake-up call. Immediate feedback is very important when starting to use CBM. Feedback at the end shows how well a student is using the options, and final scores include a bonus for effective use, added to the conventional accuracy. Typically, university students give about 40% of their answers at C=1, 20% at C=2, 40% at C=3, depending of course on how difficult the Qs are. There's nothing special about learning medical science - just lots to learn, and there will always be some bits you know and some you don't. It's important to acknowledge and get right the difference.
CBM actually increases the statistical reliability of final scores by up to a factor of 1.5 or more. This means that you don't need so many Qs to get scores that correlate well with student ability. It's important however, before students use CBM in summative (or otherwise important) tests that they get plenty of practice in self-tests for their own benefit, so they get the hang of it. 20 would seem a reasonable test length, though luck can obviously still play a significant part. I'm more interested myself in using tests to encourage constructive thinking and self-assessment than for pass/fail issues - though I acknowledge these are necessary. Lots of issues and discussion are covered in my publications here .