Switching to Unicode - problems

Switching to Unicode - problems

by Daniel Mikšík -
Number of replies: 3
Picture of Core developers Picture of Translators
As more and more of our teachers are asking for the possibility of using two and more languages in a course, I tried switching to Unicode at my Moodle (1.3.2) test site: I recoded the Czech language pack (others will follow), dumped the database (data stored in iso-8859-2 encoding), recoded the data to utf-8 and inserted it back, switched the $CFG->unicode to true in config.php. As far as displaying and saving information almost everything is working fine, three problems are listed below.

1. Names of activities are diasplayed incorrectly at the course main page; if I understand it right, they are taken from 'modinfo' column in 'mdl_course' table. As they are saved in urlencoded form in the database, they did not get recoded from iso-8859-2 to utf-8 with the rest of the data, the trouble is that the urlencoded form differs for iso-8859-2 and utf-8. Example:

Plat%ED+z%E1kon+d%BEungle%3F - stored in the db, originally through a form displayed in iso-8859-2;

Plat%C3%AD+z%C3%A1kon+d%C5%BEungle%3F - urlencoded name if a form displayed in utf-8 is used.

Workaround: Aftrer I update any single activity in a given course, all activity names in the course get urlencoded properly for utf-8 and are displayed at the course main page correctly.
This method is not very practical when you have a number of courses, my question is whether rebuild_course_page function in course/lib.php could be used for all courses while securing that the data will be urlencoded for utf-8?


2. In course/mod.php, line 449, the strtolower function does not work with Czech (and other languages?) when the given language file is stored in utf-8 and the update/add page is being displayed in utf-8; with iso-8859-2 everything was working fine.


3. Dates are displayed incorrectly, my locale is cs_CZ, no AddDefaultCharset in Apache config.

Thanks in advance for any and all suggestions.
(And, as ever, many thanks for Moodle and its community.)
Dan
Average of ratings: -
In reply to Daniel Mikšík

Re: Switching to Unicode - problems

by Petr Skoda -
Picture of Core developers Picture of Documentation writers Picture of Peer reviewers Picture of Plugin developers
Automatic conversion to UTF-8 is not a problem, I have created some simple php script to convert all language packs - I should be able to release it soon. I have used iconv() function for the conversion, it works fine when converting to UTF-8, but does not work in the reverse direction sad (If it encounters unsupported character it just stops without notice).

UTF-8 problems:
  • case conversion is defined in locale only for specific encoding - in Unix cs_CZ.ISO-8859-2 , in Windows csy (win-1250); I do not know how to change it in Windows. Case conversion is not country specific, it is defined in UTF-8 standard - mbstring extension (mb_convert_case) can handle it.
  • date/number formatting - defined in platform locale again, IMHO I18Nv2 could be the solution to this problem because it is platform independent.
  • substring, length of string, individual characters of string - the only way I know is to use mbstring extension. Mbstring extension is very complex, it was not originally made to support UTF-8, it is not officially supported and IMO will not be sad From the implementation point, we could use function overloading or some helper methods.
  • regex - ereg function can be overloaded by mbstring, PCRE "preg_.+" function must use 'f' modifier. Well, both of them seem to be a bit buggy though sad
  • encoding conversions - iconv works fine for conversion to UTF-8; mb_convert_encoding can even convert from UTF-8 with some substitution characters for missing ones, but does not support all Windows encodings (such as Win-1250)
  • alphabetical sorting - very problematic, because it is country dependant. We could define callback function for each locate. Another solution would be to use database sorting, but only a small number of locales is supported (of course czech is supported very well big grin)

In reply to Petr Skoda

Re: Switching to Unicode - problems

by Petr Skoda -
Picture of Core developers Picture of Documentation writers Picture of Peer reviewers Picture of Plugin developers
I have just noticed that case conversion in Unicode is also locale specific (famous Turkish small dotless 'i' letter big grin). See this interesting article.
In reply to Petr Skoda

Re: Switching to Unicode - problems

by Daniel Mikšík -
Picture of Core developers Picture of Translators
Thank you, Petr, for all your suggestions.

  • I solved my problem no. 2 by replacing strtolower with mb_convert_case as you suggested - seems to work fine.
  • After playing with mb_convert_encoding for some time I found out that the simplest way to correctly display dates in different languages in Unicode is to set the $string['locale'] in lang/xyz/moodle.php to xyz_XYZ.UTF-8 (e.g. cs_CZ.UTF-8 in lang/cs/moodle.php, fr_FR.UTF-8 in lang/fr/moodle.php, etc.).
  • I have no solution to my problem no. 1 yet.
  • I gave a try to iconv but soon converted to recode. smile
  • It would be great to have UTF-8ed language packs in CVS with your script...

Thanks again.

Dan