Moving Moodle to Unicode

Moving Moodle to Unicode

by Martin Dougiamas -
Number of replies: 26
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Particularly helpful Moodlers Picture of Plugin developers Picture of Testers
Unicode is obviously the future, providing a single way to represent all characters rather than the mish-mash of incompatible encodings we have at the moment.

We need to start thinking about moving to Unicode, with or without proper PHP support for it. Having a consistent encoding will make it easier to include multiple languages on one page, and simply a lot of issues when dealing with text.

The problem is that it appears any conversion process is going to be slightly less than 100% exact, and the amount of data being generated every day in Moodles around the world under old encoding rules is a bit staggering (the problem grows over time)

What we need is an implementation plan ... a step by step analysis of how we are going to convert old texts, modify core Moodle code, and ensure that all new text is always in UTF-8. I have a sense of what is required from pages like these:

but does someone want to tackle a proper roadmap for us and give us all a realistic perspective on this job?
Average of ratings: -
In reply to Martin Dougiamas

Re: Moving Moodle to Unicode

by Vu Hung -
I quite agree. I'm Vietnamese. I really met difficulties during localization process because of UTF-8 problem.
I wonder if  I can give you a hand on the way moving Moodle to Unicode.
In reply to Martin Dougiamas

Re: Moving Moodle to Unicode

by Thomas Robb -
I recently posted a message in the "Moodle for Language Teaching" area concerning problems I was having getting French and Japanese to coexist in the same text.

http://moodle.org/mod/forum/discuss.php?d=18891

I've since discovered that everything works fine *if* I use the plain text editor, but that the WYSIWYG editor mangles text when things are saved.

One priority might be to get it to work correctly.

I would hope that, once everything does go Unicode, that there will be an automatic function that goes through the entire database and converts all text files to unicode from whatever they currently are. Hmm, I wonder if this is possible since the text fields themselves don't say what the encoding is; that information is in the course preferences, right?
In reply to Thomas Robb

Re: Moving Moodle to Unicode

by Martin Dougiamas -
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Particularly helpful Moodlers Picture of Plugin developers Picture of Testers
Exactly right, the texts don't contain their encoding (not explicitly anyway) which is why an automatic conversion is going to have make a lot of educated guesses.

I'm not sure if that problem you experienced is actually caused by the editor ... the browser doesn't do any encoding itself AFAIK, but this is exactly the kind of problem we need native Unicode support to solve. Just changing the encoding string isn't enough, as you've found.
In reply to Martin Dougiamas

Re: Moving Moodle to Unicode

by koen roggemans -
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Particularly helpful Moodlers Picture of Plugin developers Picture of Translators
May be something is possible with the restore function.
I guess in a backup the encoding is known for every text  ??
In reply to Thomas Robb

Re: Moving Moodle to Unicode

by Bill Burgos -
Tom,

I don't know if this will help. You might give it a try. I found it on a homepage quite a bit back. I was using Mambo and PhpWebedit, both were mangling the Japanese.

When I put this in the root folder .htaccess file of the Mambo
installation that I wanted to present in Japanese, I was able to input Japanese and the result came out O.K.:

<IfModule mod_php4.c>
php_value default_charset UTF-8
php_value mbstring.language Japanese
php_value mbstring.detect_order auto
php_value mbstring.http_input pass
php_value mbstring.http_output pass
php_value mbstring.internal_encoding UTF-8
php_value mbstring.substitute_character none
php_value mbstring.func_overload "0"
php_flag mbstring.encoding_translation Off
</IfModule>

You might want to give it a try and it might be a starting point.

Bill
In reply to Martin Dougiamas

Re: Moving Moodle to Unicode

by Martín Langhoff -

A few early notes on this, based on my experience unicode-ifying the Midgard CMS/CMF (as part of the NZ's e-govt portal). This sounds like a plan, but it's a very draft thing, and in the end, a suggestion. Chime in, shred it to pieces.

Unicode-clean HTML

Getting Moodle itself to be Unicode-clean in the handling of forms and character entities. This is relatively easy, and involves making sure we have the right encoding in our HTTP headers, and that htmlentities() and friends are always called with charset. At this stage we should also convert all our strings files (iconv is your friend).

With this, we ensure that Moodle+MySQL will store/retrieve the data correctly, even if neither is too aware of the fact that we are using utf-8. Some string operations may not work as expected on non-ASCII texts, but different languages can share the same HTML page. This will sort out new installations.

Working with Te Reo Maori, we found that string ops worked mostly OK when utf-8 text, as you'd expect. Regexes that try to match [:alpha:] and similar stuff will get in trouble, as will strtopos() and length. It isn't that bad though, and I hope PHP5 is smarter about this. Writing replacements to these functions that support unicode in PHP is hard and errorprone, and they will be awfully slow. This is specially true of utf-8. Not long ago I wrote some Perl routines to find invalid utf-8 characters, and it was not fun.

Skip the Database for now …

There is one more important step, which is telling the database we are storing unicode. I think it's really hard, and (most importantly) I am not sure it will have many benefits to Moodle. More on that later. It may make sense to deal with migration of content at this stage, depending on the release cycles, and what we find about the database-level migration.

Convert existing content

So I, in this weird plan outline, there are two migrations. The first migration should read all the varchar fields and iconv them (we'll need the iconv PHP extensions installed to upgrade). As discussed elsewhere in this thread, we'll probably need to guess what the encoding is likely to be, from a combination of site lang, user lang, course lang. Perhaps some module-specific data may need further logic to handle it. Still, I think we should keep the original in a backup table (indexed on an md5 hash of "$tablename-$id-$fieldname") should we need to reconvert.

Perhaps we'll need a means for users to identify mis-encoded texts, and self-select an alternative encoding that looks right to them. How the UI should work to be scalable and applicable to different scenarios is beyond me smile

Tell the Database … perhaps not?

Both MySQL and Postgres are used to having their VARCHAR and TEXT fields abused with weird stuff, so their handling of them is binary-safe; you can store a GIF there, and it'll work. So storing utf-8 there is safe, even if we don't tell the RDBMS. Naughty, ah!

What's the downside of not telling the database? * String matching when running searches is limited to exact matching. Moodle is not too badly affected -- we don't use LIKE much. (With the new forum search we may be using LIKE/ILIKE more.) * String operations will do strange things (UPPERCASE()/LOWERCASE()). Again, we don't use them. * Sorting from the DB won't be "correct" for non-ASCII scripts. Truth is, it's really hard to get the right sorting for non-ASCII; for starters, we need to get the locale well defined. Any situation with mixed locales won't have a definite sorting.

What's the downside of actually telling the database? * We'll bump up our requirements (MySQL 4.1 I think, should check postgres) and still, support on MySQL is limited. See in http://dev.mysql.com/doc/mysql/en/charset-unicode.html that it only supports 1,2 and 3 byte characters. 4-byte chars are unsupported. * Good utf-8 support depends on the database driver too. People will need recent PHP versions with current drivers.

Finally, it can be hard to migrate to a utf-8 database encoding. Postgres seems to have better utf-8 support than MySQL -- including charset conversion functions that could save a ton of time -- but you have to completely recreate the database to change the charset. Under MySQL, charset is a table property -- we'll still have to recreate the tables.

On both databases, migration will be slow, probably should not be done via a browser due to timeouts, and will probably mean large temp files, and such. Sounds rather messy sad

Some additional resources

A brief discussion of all the steps required to get end-to-end Unicode going: http://archives.postgresql.org/pgsql-jdbc/2004-11/msg00075.php

Braindump over

Tired! I'm off to prepare 1.4.4. Hope this is of use …

Average of ratings: Useful (1)
In reply to Martín Langhoff

This forum post has been removed

The content of this forum post has been removed and can no longer be accessed.
In reply to Deleted user

Re: Moving Moodle to Unicode

by Martín Langhoff -
It says "buy now!" ;)

I was thinking of sorting it out with iconv, a bit of shell and perl scripting, some glue, and a lot of string... gnu string.
In reply to Martín Langhoff

This forum post has been removed

The content of this forum post has been removed and can no longer be accessed.
In reply to Deleted user

Re: Moving Moodle to Unicode

by Martin Dougiamas -
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Particularly helpful Moodlers Picture of Plugin developers Picture of Testers
It would be a great idea if we were just translating all the texts in the Moodle code, but we also have to think about all the texts in everyone's databases, on all kinds of platforms.   For this a PHP-only solution will be best.
In reply to Martín Langhoff

Re: Moving Moodle to Unicode

by Jamie Pratt -
An alternative to using iconv would be to code our own routines for doing conversion to utf-8 for converting strings in the db to utf-8 and files in the lang/ directory.

Since this would be a once only process to get everything into utf-8 we could code this in php, it might be slow but once its done its done.

From unicode.org I found a link to IBMs open source project that deals with character set conversions among other things.

http://www-306.ibm.com/software/globalization/icu/chartsdemostools.jsp

They have code conversion charts for hundreds of character sets:

http://www-950.ibm.com/software/globalization/icu/demo/converters?s=ALL

Open source code is available in C,C++ and Java apparently under a license compatible with GPL we could maybe port bits of it to php.

In reply to Jamie Pratt

Re: Moving Moodle to Unicode

by Jamie Pratt -
There is also this class here :

http://phpclasses.spunge.org/browse/package/1974.html

It seems that this class works with files you can download from unicode.org so that it can work out how to correctly encode / decode utf-8

Hope we do move over to utf-8 on Moodle soon as Flash uses unicode natively and so this would help with Moodle Flash integration.
Average of ratings: Useful (1)
In reply to Jamie Pratt

Re: Moving Moodle to Unicode

by Martín Langhoff -
Moodle requires the XML parser extensions, and those depend on expat which depends on libiconv... PHP installs that have XML support are likely to have iconv support as well. At any rate, the xml parser exposes the iconv functionality, it's trivial to pass it some string in a given charset and ask for it in utf-8.

Iconv, being widely used, is a serious alternative wrt speed and correctnes. A well oiled wheel we can use smile


In reply to Martín Langhoff

Re: Moving Moodle to Unicode

by Janne Mikkonen -
if you convert this string þŒßê from iso-8859-1 to utf-8 with iconv what'll happen? In my tests it's not the same string any more after conversion....
In reply to Janne Mikkonen

Re: Moving Moodle to Unicode

by Martín Langhoff -
It shouldn't be the same string after it's converted smile

How did you create the iso-8859-1 file? How did you convert it? How did you evaluate the result? Most editors that promise unicode support don't really support much beyond western languages. Yudit is the only editor that has been useful for me when dealing with really complex scripts. It's a bit weird, but it handles all the range of unicode encodings, charsets, etc. http://www.yudit.org/
In reply to Martín Langhoff

Re: Moving Moodle to Unicode

by Janne Mikkonen -
"It shouldn't be the same string after it's converted" - I beg to differ wink though I understand what you mean by that, but the visual outcome for end user should be the same (on his/hers browser screen) after you convert string to from one encoding to another. And I'm not consern about files but the stuff in the database.

If I convert that string to entities then outcome is same with iso-8859-1 and utf-8 but when done the same thing with iconv results are not quite what I've expected (no matter what kind of internal encoding is used).

But I'm not saying here that we shouln't use iconv or mb_*. We just need to be sure, what ever script or functions we use to convert strings to one encoding to another that data isn't lost at any point.
In reply to Janne Mikkonen

Re: Moving Moodle to Unicode

by Martín Langhoff -
At the level of the bits, I think we agree that iconv will transform the string. Now, when we talk about the visual outcome as displayed by the browser, you will need to tell the browser to interpret one page as ISO-8859-1 and the other page as UTF-8.

I've worked with this quite a bit, and once you place the right encoding headers in the page, the browser starts displaying it correctly (for Netscape 4.7+ and IE4+). This depends a bit on the OS having fonts that cover the characters you want to display, too. Kanji characters only show up if you have a Unicode font that includes tha Kanji range.

Windows95/98 are notably slack in the unicode font department. More modern Win32 OSs, and generally any OS more recent than 2000 will have good unicode fonts.
In reply to Martín Langhoff

Re: Moving Moodle to Unicode

by Martin Dougiamas -
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Particularly helpful Moodlers Picture of Plugin developers Picture of Testers
Is iconv standard on Windows?
In reply to Martin Dougiamas

Re: Moving Moodle to Unicode

by Tim Allen -
Hi moodlers,

Unicode is eagerly anticipated in Korea too, another multi-byte language.  I have recently published a unicode version of Korean which is on the downloads page and in CVS.  I converted the encoding using the "recode" library.

Just wondering about this custom setting in config.php:

// This setting will put Moodle in Unicode mode.  It's very new and
// most likely doesn't work yet.   THIS IS FOR DEVELOPERS ONLY, IT IS
// NOT RECOMMENDED FOR PRODUCTION SITES
//      $CFG->unicode = true;

I have tested it and not had any problems so far, despite the disclaimer.  Is it a necessary setting to enable when using unicode language packs?  thoughtful

I'll watch this thread with interest.  cool

Tim.

In reply to Tim Allen

Re: Moving Moodle to Unicode

by Timothy Takemoto -

Hi Tim and everyone,

I know that this is off topic, since this is a developers' forum, and the purpose of this is to discuss ways of creating a php unicode converter, but for those that are thinking of converting their database by hand...

Mitsu, the Japanese language translator, recommends downloading the database of Moodle in EUC-JP and then opening it in an editor such as Sakura Editor, then resaving the file in UTF8, and then reimporting it to the (or preferably a new) database.

Timothy

In reply to Timothy Takemoto

Re: Moving Moodle to Unicode

by James Phillips -
Thanks Tim. I found that suggestion very useful. Sakura editor also looks like a nice little program. 
In reply to Martin Dougiamas

Re: Moving Moodle to Unicode

by Vu Hung -
If I don't make a mistake, Moodle uses Trebuchet as default font. I guess Trebuchet does not support correctly all languages (unicode) including Vietnamese.
To show correctly Vietnamese language during localization process, I have to change the default font of Moodle into Verdana (http://el.edu.net.vn/lms/).
Martin, can you change default font of Moodle if my opinion is right? I'm managing Vietnamese course ( http://moodle.org/course/view.php?id=45 ) but have the same problem with Trebuchet font.
I want to know the ideas from all of you.
Thanks in advance!
In reply to Vu Hung

Re: Moving Moodle to Unicode

by Vu Hung -
I forgot one thing. I have the same problem with HTMLArea. HTMLArea also uses the default font Trebuchet so we can not type Vietnamese correctly. I have to change the defaut font to Verdana in the Admin section of Moodle.
I don't known whether other languages have the same problem as Vietnamese. We'd better choose a standard font supports all languages.
Is that right? Let's discuss.