Character encoding to UTF-8

Character encoding to UTF-8

by John Papaioannou -
Number of replies: 6
Hi all,

Yesterday I had to solve the "greek characters do not get displayed in user activity graphs" problem within an extremely small time limit, and it was rather... distressing. But in the process, I found out something which may be of interest to the Moodle core.

After poking around, the solution was found in creating a lang/el/fonts/lang_decode.php file with a function like this:

function lang_decode($s) {
$len = strlen($s);
$out = '';
for($i=0; $i < $len; $i++) {
$ch = ord($s[$i]);
$out .= $ch > 128 && $ch < 256 ? '&#'.(720 + $ch).';' : chr($ch);
}
return $out;
}

This solved the problem, although the things I did to find out that 720 magic number would be amusing, if it were someone else doing them. However, it was later pointed out to me that there is a very nice iconv() function which would simplify matters:

function lang_decode($s) {
return iconv('ISO-8859-7', 'UTF-8', $s);
}

So I have two questions:

  1. iconv needs PHP compiled with --with-iconv. Does anyone know if this is standard? Can we count on iconv() to be present in a plain vanilla PHP installation? Personally, I don't remember.
  2. If the answer to the above is yes, Martin, since Moodle knows the source codepage from the language configuration, wouldn't it make sense to drop support for lang_decode.php and move this within the Moodle core?

Cheers,
Jon



Average of ratings: -
In reply to John Papaioannou

Re: Character encoding to UTF-8

by Martín Langhoff -
Or you can drop codepages in the lang files. You can convert them to utf-8 with commandline iconv.

This is be transparent for new installations. Upgrades will imply converting database content to utf-8 wholesale. Very risky, but the issues around that conversion highlight the problem of codepages in an application like moodle.
In reply to Martín Langhoff

Re: Character encoding to UTF-8

by John Papaioannou -
Migrating globally to UTF-8 encoding would certainly be nice, but there are some gotchas:

  1. View source? Anyone? smile
  2. Translators will write in their native language. So there has to be some way of converting strings to UTF-8 transparently. This means that probably Moodle will need to "cache" UTF-8 encoded verisons of all strings dynamically or suffer a (large?) performance hit. We need a mechanism for that.
  3. Since a simple CVS update might add (or even worse, change) language strings without changing anything else, the validity of the aforementioned "cache" will need to be constantly checked. This could negate or even worsen the performance of the cache, relative to no cache at all. Thus the performance hit could be both large and unavoidable.
  4. UTF-8 encoded pages with a high percentage of user-generated content (such as forums) might dramatically increase the size of the HTTP response. Think of a large thread written in Greek, for example. That would probably be something like a 400% increase on page size!

I don't think it's easy to overcome these problems... anyway, my initial query had to do with converting the strings that go into graphs, not the whole of Moodle.

Jon
In reply to John Papaioannou

Re: Character encoding to UTF-8

by Martin Dougiamas -
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Particularly helpful Moodlers Picture of Plugin developers Picture of Testers
No, iconv isn't standard, unfortunately. I looked into this a long time ago which is why that lang_decode function is there, though I never got it functioning for Thai at the time.

Good to hear it works well for Greek! What you need in there is just a check for iconv() and then perhaps your original conversion code can be used if the function doesn't exist.
In reply to Martin Dougiamas

Re: Character encoding to UTF-8

by John Papaioannou -
Well then, would it be acceptable if any relevant Moodle code were changed to fall back on iconv() if there is no lang_decode.php? Assuming that function_exists('iconv') of course. We know the source codepage from the language settings, so that's no problem.

In reply to John Papaioannou

Re: Character encoding to UTF-8

by Martin Dougiamas -
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Particularly helpful Moodlers Picture of Plugin developers Picture of Testers
Good idea - done in CVS. smile

All we need to do now is add Unicode TTF fonts into lang/xx/fonts/default.ttf. These are too big (and legally questionable) to add to the Moodle distribution, but any one running a site can do it.

I copied c:\windows\fonts\mshei.ttf from Windows into moodle/lang/zh_cn/fonts/default.ttf and it worked. smile
Attachment zh_cn.png
In reply to John Papaioannou

Re: Character encoding to UTF-8

by Richard Ramsay -
Jon,
Can you help me? I'm having problems with Greek fonts in test questions. I want to mix Spanish and Greek in the questions. For example, "Select the correct translation for each of the following words," then give a list of vocabulary in Greek. But no matter what I do to the htm code, in some of my questions (it seems unpredictable), the whole question is either Spanish or Greek fonts. I can't mix them. Even "puntos" appears in Greek fonts. I am using IE 7.0. The questions look fine in Firefox 5.0, but not Explorer.