word count mismatch MS word, moodle - online assignments

word count mismatch MS word, moodle - online assignments

by steve miley -
Number of replies: 7

Hi - I'm wondering if I can get any other thoughts on whether this might be a bug that I should submit to the moodle issue tracker.  Thank you ahead of time for your thoughts.

The moodle function count_words  is counting the HTML markup "&nbps;" as a word, thus the word count is different in MS word and Moodle.

In the online assignment,  it will show the student and the instructor a word count.  Often instructors might impose word counts on an assignment.  Many students will do a copy/paste from word.   Unfortunately, this can create "noisy/busy" html in the paste.  There can be many "&nbps;"  which is white space in an html document.   The current moodle word count in 1.9, 2.x count these as words. Generally, there might be 10 or 20 of these, but we've seen up to 1000 (bumping a 500 word document to a 1500 word document).

The fix is easy, just 1 line in moodlelib.php.

$string = preg_replace("/ /"," ",$string);

 

Thoughts on submitting this as a bug?

 

Steve

Average of ratings: -
In reply to steve miley

Re: word count mismatch MS word, moodle - online assignments

by Richard Oelmann -
Picture of Core developers Picture of Plugin developers Picture of Testers

Oh I so want to suggest filing this as a bug report - with Microsoft!!! Their 'dirty' html causes issues when people want to copy and paste in all sorts of situations winkbig grin

Richard

In reply to steve miley

Re: word count mismatch MS word, moodle - online assignments

by Ratana Lim -

problem with splitting just the word and word boundary (\w\b) is that it also split contraction words like "can't" and "don't"  which will result in more words than it should be. the count_words function is too simplicistics and possibly only works with western languages--i'm not sure how to improve the implementation for different languages. here's my hack that may work a little better.

/**
* Count words in a string.
*
* Words are defined as things between whitespace.
*
* @param string $string The text to be searched for words.
* @return int The count of words in the specified string
*/
function count_words($string) {
$string = strip_tags(html_entity_decode($string, ENT_QUOTES, 'UTF-8'));
$splitted = preg_split('/\s+/', $string, null, PREG_SPLIT_NO_EMPTY);
$counted = count($splitted);
foreach($splitted as $i=>$v) {
$v = trim($v);
$joined = preg_split('#[,;\.?!\/]#', $v, null, PREG_SPLIT_NO_EMPTY);
if (count($joined) < 2)
continue;
if ( empty($joined[0]) || empty($joined[1]) )
continue;
$joined[1] = trim($joined[1]);
error_log(quoted_printable_encode($joined[1]));
if ( preg_match('/\w/', $joined[1]) ) {
$counted++;
}
}//endforeach
return $counted;
}
In reply to Ratana Lim

Re: word count mismatch MS word, moodle - online assignments

by Frankie Kam -
Picture of Plugin developers

Hi Ratana

I remember spending hours trying to solve this. How difficult it is to write code to count words accurately. It's like trying to retain water that falls off a duck's back! Anyway, here's my effort which I tested on samples of paragraphs that had words running in the tens of thousands. I compared the results with Ms Words (in)famous word count on the Ms Word status bar, and the results weren't too far off.

To find out how I did it, surf to: http://moodurian.blogspot.com/2012/04/spicing-up-forum-activity-for-moodle.html which then leads you on to: http://www.moodlenews.com/2012/spicing-up-the-forum-module-for-moodle-1-9/

Regards
Frankie Kam

In reply to Frankie Kam

Re: word count mismatch MS word, moodle - online assignments

by Nicholas Walker -

Hi Frankie,

Your forum word-count hack is very useful. Is it possible to display a total word count for each student? Or even better, a word count plus a ranking to see who is the most verbose? It would be a nice addition to the profile page.

Best wishes,

Nick 

In reply to Ratana Lim

Re: word count mismatch MS word, moodle - online assignments

by Frankie Kam -
Picture of Plugin developers

Hi Ratana!

Your 'hack' didn't work a "little better". It worked a whole lot better!
I've copy-and-pasted the entire Top Gun movie (Tony Scott, may his soul rest in peace...) transcript from: http://www.script-o-rama.com/movie_scripts/t/top-gun-script-transcript-cruise.html, and removed the double qoutes in the transcript. Then I tested the whole thing with a php file (see attached try.php).

Guess what? Ms Word reports a word count of 6,601.

Your wonderful hack reports 6,603.

That's an increase of just 1 word count for every 3,000 words. Not bad, not bad at all. I'm impressed.

And then I've tested it on the Gone With The Wind film transcript (15,942 words in Ms Word!), and the function returns a count of: 16,000 words. Hmmm...

In reply to Frankie Kam

Re: word count mismatch MS word, moodle - online assignments

by Ratana Lim -

Hi Frankie,

Thank you for testing out the hack.  I've only done a few tests.  Something to consider in the hack is the word split pattern.  The pattern  

\s+\b

seems to work a little better that just \s+.  Also the pattern in the line  

 $joined = preg_split('#[,;\.?!\/]#', $v, null, PREG_SPLIT_NO_EMPTY);

 to detect spacing error such as no space between comma forexample could be improve. In some situation it may be necessary to catch some of the dashed word that are two or more words ie "up-to-date". Another one is the long dashes—used in place of commas to emphasized a phrase.

There's probably a better way/logic than just pattern splitting...I'll leave that to the code gurus.

In reply to steve miley

Re: word count mismatch MS word, moodle - online assignments

by Michael Litzkow -

We too  have been struggling with the mismatch between MS Word and Moodle version of word count.  The discussion and suggested code changes here are very interesting, but we would much prefer if core Moodle code would do the implementation rather than adding another hack to our implementation.  So, getting back to the original question, I would say yes, absolutely, please do submit this as a bug.  Also, I understand and agree about text pasted from MS applications being a mess, but that's the world we live in.  I think we need to do the best we can with it, as getting MS to change their minds is unlikely.

-- mike