utf-8 and mysql

utf-8 and mysql

by Robert Collazo -
Number of replies: 12
I've been running Moodle for several months now with no problems.  I need to migrate my Moodle installation to another server.  I'm running Gentoo Linux on my production server (www) and also on the machine I'm migrating to (testbed).  They are both running Apache 2.0.55-r1, MySQL 4.1.14 and PHP 5.1.1.  The only difference between the two machines is that on the testbed machine, I compiled MySQL using the "UTF8" use flag.  However, when I try to view content that is not in English on the testbed machine, it displays like this:

гÑабÑж

rather than this:

грабёж

Any thoughts?  I've tried to view it in both IE and Firefox.  The encoding for the pages is set as Unicode (UTF-8).

The relevant portions of MySQL's my.conf file:

[mysql]
character-sets-dir=utf8
default-character-set=utf8

[mysqladmin]
character-sets-dir=utf8
default-character-set=utf8

[mysqlcheck]
character-sets-dir=utf8
default-character-set=utf8

[mysqldump]
character-sets-dir=utf8
default-character-set=utf8

[mysqlimport]
character-sets-dir=utf8
default-character-set=utf8

[mysqlshow]
character-sets-dir=utf8
default-character-set=utf8

[myisamchk]
character-sets-dir=utf8

[myisampack]
character-sets-dir=utf8

# use [safe_mysqld] with mysql-3
[mysqld_safe]
err-log                                         = /var/log/mysql/mysql.err

# add a section [mysqld-4.1] or [mysqld-5.0] for specific configurations.
[mysqld]
character-set-server            = utf8
default-character-set           = utf8

I also have "AddDefaultCharset off" in my httpd.conf file on both machines.  The only difference between the two machines is that I compile MySQL with the "USE=utf8" flag on the testbed server.

The language file for my site is en_us_utf8 which sets the encoding to utf8.  I'm also attaching a screenshot.  The top portion is the correct way it should be displayed and the bottom portion is the way it appears on the testbed machine.  Any help would be appreciated.
Attachment screenshot.jpg
Average of ratings: -
In reply to Robert Collazo

Re: utf-8 and mysql

by Juan David Martínez Pavony -

Hello Robert.

Did you do a mysqldump to move data from production to testbed? Or are you using the same DB from testbed (connecting testbed to production DB)?

If you choose the first way, there is an option you must use at dump time:


--default-character-set=charset_name

Use charset_name as the default character set. See Section 5.10.1, The Character Set Used for Data and Sorting. If not specified, mysqldump from MySQL 4.1.2 or later uses utf8, and earlier versions use latin1.


Have a look at MySQL WEB site.

Good luck.

In reply to Robert Collazo

Re: utf-8 and mysql

by Don Hinkelman -
Picture of Particularly helpful Moodlers Picture of Plugin developers
Martin announced on another forum that MySQL has some bugs related to unicode.  He recommends upgrade to MySQL version 4.1.16, and said the Moodle version 1.6 will require that level MySQL as a minimum.  I notice you are using 4.1.14, and wonder if that is a source of your problem.
In reply to Don Hinkelman

Re: utf-8 and mysql

by Robert Collazo -
Well, since they were both MySQL 4.1.14, I copied the DB files (/var/lib/mysql/moodle/*.myd,myi,etc.) from www to testbed.

I tried upgrading to 4.1.16 of MySQL and the results were the same when compiling with the USE="utf8" flag..  
In reply to Robert Collazo

Re: utf-8 and mysql

by Eloy Lafuente (stronk7) -
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Peer reviewers Picture of Plugin developers Picture of Testers
Hi Robert,

and what was the DB encoding of your original DB? If nothing was specified when you created it (previously to Moodle 1.6), it should be, by default, 'latin1' and PHP DB connection is also set to 'latin1' by default (if you didn't modified php.ini nor my.cnf in your original server).

This means that everything in you old server was sent over one 'latin1' channel (between PHP and MySQL) and stored in 'latin1' tables in MySQL, although it seems that you've used non 'latin1' characters in your old server.

All this implies that the data currently stored in you DB is really wrong (from a purist technical point of view) because it isn't proper 'latin1' data (because you have stored some non-latin1 data) nor proper 'utf8' data (because you have used one 'latin1' communication channel to send the data).

Luckily, this is only from a "purist technical point" of view and MySQL allows to store non-latin1 data in latin1 tables and, when data is retrieved from DB back to PHP the inverse encode is performed and everything seems to work pretty fine.

But this imposes serious limitations. For example, you cannot change the communication channel encoding (as you are doing in you edited my.cnf example above), nor can force any encoding in your web server (because its' possible that you have one mix of different encodings in your DB).

This was one of the main reasons to the new UTF-8.-ized Moodle ASAP, because DB internals were mixing contents in really different encodings and everything was based in the user lang and difficult to handle, more every day, with Moodle trying to communicate with other systems.

Obviously, as that nightmare of different encodings co-exist in a lot of servers, the conversion process isn't as simple as reconvert all the fields from 'latin1' to 'utf8', set the communication channel encoding to 'utf8' and to continue working, because your data isn't proper 'latin1' data as I explained 2 paragraphs above.

This implies that, if your site was being used by users using different encodings, every content (every field, every record!) has to be transformed to 'utf8' from its ORIGINAL encoding (user based) and, well, it cannot be performed by hand. wink

So, who execute/handle this really expensive task? Can you imagine it? Yes, it's Moodle 1.6. Once installed it will detect that your DB isn't running if the proper 'utf8' mode and will offer you the possibility to process all the info in order to convert every content to 'utf8' introspecting in each field, analysing who sent that content to DB and performing the required conversion. cool (great job, Yu!)

At the end of the looong process (depending of its size), once everything has been converted, Moodle itself will put the communication channel to 'utf8' and since that moment, everything (database, communication channel and http) will be running under 'utf8'.

Sounds simple, eh?

So, for sites having 'mixed' contents (contents stored in DB under different encodings) like you, the upgrade path should be something like:
  1. Backup everything before upgrade!
  2. Use MySQL 4.1.12 (and upwards), avoiding to set anything in their configuration files to force any encoding at all (mysql, php, apache). It must work exactly the same than the old DB.
  3. Standard Upgrade Moodle with the newer version.
  4. With this, you'll have one non-utf8 Moodle site running and you should be able to see everything exactly as it was before.
  5. Backup everything again!
  6. Execute the utf8 migration utility. If something stops the migration process it can be safely continued by launching it again (it remembers where it ended).
  7. At the end of the 'utf8' migration, one message will appear telling you what languages you have to install (annotate them).
  8. Login to your new site and install the required languages.
  9. Voilà, everything should be utf8 and be working like a charm! cool


Sites being 100% sure about they are using ONLY ONE encoding can specify it in the process above, and it would save a lot of CPU cycles to the looong migration process. But do it, ONLY if you are sure of the encoding of all your DB contents. Else, data loss could arrive!

And this is all, I hope it had explained a bit more about the utf8 thing, do's and dont's...

Ciao smile
In reply to Don Hinkelman

Re: utf-8 and mysql

by Anthony Borrow -
Picture of Core developers Picture of Plugin developers Picture of Testers
I was not sure whether to post this in install or performance but since my question is really about improving the performance I opted for here. I would like to know if there is a faster way to do the utfdbmigrate.php. In particular I was wondering if this could simply be done through phpmyadmin. I have not looked to see what the script is actually doing but as it chugs through my 300000+ question_states 4 at a time it really seems slow. It almost seems that there is a delay in there. Since the migration involves taking Moodle offline I think it would be good to have something quicker. Any thoughts, suggestions, etc. would be most welcome.
In reply to Anthony Borrow

Re: utf-8 and mysql

by Gavin McCullagh -
On a related note, I started off using moodle v1.5.2 with postgresql and I broke slightly from the norm and created a UTF-8 Language and used UTF-8 in the db.  If I upgrade to v1.6 I'd rather a utfdbmigrate.php script didn't run and munge my data.

Is a check built in that to make sure the data isn't already UTF-8?

Gavin
In reply to Gavin McCullagh

Re: utf-8 and mysql

by Eloy Lafuente (stronk7) -
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Peer reviewers Picture of Plugin developers Picture of Testers
Well,

if you DB was set to UTF-8 and all your contents were properly stored as UTF-8 strings, you are really lucky, because, if I'm not wrong, all you have to do is to upgrade to 1.6 (standard upgrade).

Once 1.6 was running it'll autodetect DB encoding and everything will work automatically. You'll detect if Moodle 1.6 has detected your UTF-8 encoding if you don't see the option to perform the UTF-8 DB migration in your admin page.

Personally, I haven't tested this situation, but theory says everything should go ok. Obviously I would recommend you to do the test against one copy of your DB first (or having the mandatory backup previous to upgrade).

Ciao smile
In reply to Anthony Borrow

Re: utf-8 and mysql

by Eloy Lafuente (stronk7) -
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Peer reviewers Picture of Plugin developers Picture of Testers
Hi Anthony,

if you are 100% sure that all the contents in your database are using the same encoding (i.e. all the users in your servers were using the same encoding), you have two methods to migrate, gaining time:

1.- Use the standard migration utility inside Moodle, specifying the default encoding. This really helps the migration process to be quickly, reducing execution time to 1/3 of the original one (this is one "brute" approximation). wink This is the official method.

2.- Dump, your entire DB, convert it fro your UNIQUE encoding to UTF-8, create one new DB (with its default encoding to 'utf8') and load your "translated" dump file to the new DB. Everything should be loaded properly, although, perhpas, some indexes adjustments performed by the 1st alternative were lost. If they give you problems and your DB skills are enough you'll be able to rebuild such indexes without problems. This is the non-official method and I only have performed some basic tests for this (mainly because we have focused our efforts in the 1st method).

I really would recommend to use the first one, because it has been tested against a lot of configurations/sites and it always ends with everything ok. Also, I've executed the migration of moodle.org (ant it's a BIG server) in my 3-years old cheap laptop in 36 hours, more or less. If I apply the 1/3 rule above to it, time would be 12 hours, really not too much for my slooow PC laptop running Win32. Also, moodle.org is really one special case because the users table is the biggest in the world (in the world of registered sites wink ) and the migration spends practically the half of the time processing it (because the high number of string-based fields it contains).

Hope this helps, ciao smile
Average of ratings: Useful (1)
In reply to Eloy Lafuente (stronk7)

Re: utf-8 and mysql

by Ahmed Metwally -

Hi guys,

I desperately need help, I upgraded my moodle from 1.6dev to 1.6.1 and database has been migrated. But all my Arabic courses after the Unicode has been applied became rubbish, I can not read any word, it looks like this ÇáÞÇåÑÉ ãÔÑæÚ ÇáÊÚáã . So is there any solution to retrieve these courses or to cancel that upgrade. Pleas save me.

 

Thank u in advance,

In reply to Eloy Lafuente (stronk7)

Re: utf-8 and mysql

by Paolo Oprandi -
I am piloting the migration of my DB to utf8. So far it has taken me 108 hours and still processing. I have a years worth of logs (over 6 million records), which I may have to reduce next time. Nevertheless I am very interested in reducing the processing time by two-thirds as Eloy suggests and was wondering how I could have specified by default encoding.
Many thanks,
Paolo

In reply to Paolo Oprandi

Re: utf-8 and mysql

by Martín Langhoff -
If you know your DB has been used in one language, then you can use the manual strategy that's been described for PostgreSQL.