A few early notes on this, based on my experience unicode-ifying the Midgard CMS/CMF (as part of the NZ's e-govt portal). This sounds like a plan, but it's a very draft thing, and in the end, a suggestion. Chime in, shred it to pieces.
Getting Moodle itself to be Unicode-clean in the handling of forms and character entities. This is relatively easy, and involves making sure we have the right encoding in our HTTP headers, and that htmlentities() and friends are always called with charset. At this stage we should also convert all our strings files (iconv is your friend).
With this, we ensure that Moodle+MySQL will store/retrieve the data correctly, even if neither is too aware of the fact that we are using utf-8. Some string operations may not work as expected on non-ASCII texts, but different languages can share the same HTML page. This will sort out new installations.
Working with Te Reo Maori, we found that string ops worked mostly OK when utf-8 text, as you'd expect. Regexes that try to match [:alpha:] and similar stuff will get in trouble, as will strtopos() and length. It isn't that bad though, and I hope PHP5 is smarter about this. Writing replacements to these functions that support unicode in PHP is hard and errorprone, and they will be awfully . This is specially true of utf-8. Not long ago I wrote some Perl routines to find invalid utf-8 characters, and it was fun.
Skip the Database for now …
There is one more important step, which is telling the database we are storing unicode. I think it's really hard, and (most importantly) I am not sure it will have many benefits to Moodle. More on that later. It may make sense to deal with migration of content at this stage, depending on the release cycles, and what we find about the database-level migration.
Convert existing content
So I, in this weird plan outline, there are two migrations. The first migration should read all the varchar fields and iconv them (we'll need the iconv PHP extensions installed to upgrade). As discussed elsewhere in this thread, we'll probably need to guess what the encoding is likely to be, from a combination of site lang, user lang, course lang. Perhaps some module-specific data may need further logic to handle it. Still, I think we should keep the original in a backup table (indexed on an md5 hash of "$tablename-$id-$fieldname") should we need to reconvert.
Perhaps we'll need a means for users to identify mis-encoded texts, and self-select an alternative encoding that looks right to them. How the UI should work to be scalable and applicable to different scenarios is beyond me
Tell the Database … perhaps not?
Both MySQL and Postgres are used to having their VARCHAR and TEXT fields abused with weird stuff, so their handling of them is binary-safe; you can store a GIF there, and it'll work. So storing utf-8 there is safe, even if we don't tell the RDBMS. Naughty, ah!
What's the downside of not telling the database?
* String matching when running searches is limited to exact matching. Moodle is not too badly affected -- we don't use LIKE much. (With the new forum search we may be using LIKE/ILIKE more.)
* String operations will do strange things (UPPERCASE()/LOWERCASE()). Again, we don't use them.
* Sorting from the DB won't be "correct" for non-ASCII scripts. Truth is, it's really hard to get the right sorting for non-ASCII; for starters, we need to get the locale well defined. Any situation with mixed locales won't have a definite sorting.
What's the downside of actually telling the database?
* We'll bump up our requirements (MySQL 4.1 I think, should check postgres) and still, support on MySQL is limited. See in http://dev.mysql.com/doc/mysql/en/charset-unicode.html that it only supports 1,2 and 3 byte characters. 4-byte chars are unsupported.
* Good utf-8 support depends on the database driver too. People will need recent PHP versions with current drivers.
Finally, it can be hard to migrate to a utf-8 database encoding. Postgres seems to have better utf-8 support than MySQL -- including charset conversion functions that could save a ton of time -- but you have to completely recreate the database to change the charset. Under MySQL, charset is a table property -- we'll still have to recreate the tables.
On both databases, migration will be slow, probably should not be done via a browser due to timeouts, and will probably mean large temp files, and such. Sounds rather messy
Some additional resources
A brief discussion of all the steps required to get end-to-end Unicode going:
Tired! I'm off to prepare 1.4.4. Hope this is of use …