Database conversion to UTF-8 corrupts swedish content. Hints anyone?

Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by Christoffer Aronsson -
Number of replies: 20
We recently decided to upgrade our Moodle installation from version 1.5.2+ (2005060223) to the latest branch 1.6.1+. We've been runing test upgrades on our development server with cloned data from the live machine (mysql-admin dumps) and using the latest Moodle branch each time a new one is released. This has been done over a couple of days now and every time we get the same results. All our swedish content (å,ä,ö) and all "special" characters, such as &, ! etc. gets corrupted after running the utf-8 migration script.

I really can't quite figure out what the problem is.

One thing, however that's rather interesting, is that, before you launch the upgrade script, moodle checks your modules and in our case finds two modules that are not part of the standard moodle-install (book and questionare). It then warns you that "...the content in these modules might become corrupted - proceed with caution." In fact, the ONLY data in our db that doesn't become corrupted is the data located in these modules. Perhaps it's not so strange since the script obviously doesn't convert it.

But the fact that this data actually displays allright, might that not be an indicator that  something else is going wrong here? Is MySQL really spitting out UTF-8? Or is Apache misconfigured? I really need some hints.


Thanks for your time.

Best regards
Christoffer Aronsson
Dragonskolan Upper Secondary School
Umeå Sweden


Our live server:
URL: http://moodle.dragonskolan.se/
OS: Windows Server 2003
Moodle: 1.5.2+ (2005060223)
MySQL: 5.0.18-nt
PHP: 4.4.1
Apache: 2.2.0
Other: Running of an XAMPP 1.5.1-installation.

Our development server:
URL: http://194.153.36.4/
OS: Debian
Moodle: 1.6.1+ (2006050512 at the moment)
MySQL: 5.0.19-standard
PHP: 4.3.10.16
Apache: 2.0.54
Other: -
Average of ratings: -
In reply to Christoffer Aronsson

Re: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by Martin Dougiamas -
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Particularly helpful Moodlers Picture of Plugin developers Picture of Testers
What language packs do you use? Have you customised any of them? In particular what value for "thischarset" do they contain (in moodle.php)?
In reply to Martin Dougiamas

Re: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by Christoffer Aronsson -
Hey M, (quite the honor btw)

We're using sv_utf8, en_utf8, en_us_utf8 and it_utf8 - however, I should have been more detailed in my last post, since as far as I know all these languagepacks work perfectly without any corrupted characters.

The only parts of the site that contain errors are "user data" that has been processed by the migration script.

That should indicate that the utf8 output from the server is working as should - leaving only the conversion script/process as culprint for our corrupted characters?

In reply to Christoffer Aronsson

Re: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by dave streather -
We had this problem and traced it to mysql > 4.1.18 using utf8 as a default.  We had to add the following into the mysqldump command

mysqldump -uxxxxxxx  -pxxxxxxx --default-character-set=latin1 db > output file

it fixed it for us.

cheers
dave
In reply to dave streather

Re: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by Paul Nijbakker -

Hi Dave,

Thanks for this tip. It may explain why our upgrade worked fine in the test server (set up with a MySql dump), but not in the production server.

Rgrds,
Paul.

In reply to Christoffer Aronsson

Re: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by Yu Zhang -
Hi,

Was iso-8859-1 the only encoding you use on your production site? If so, I would suggest you try to set the glocal lang right before the migration (just select "en" will do). It would also speed up the process.

Please let me know after you try this.

Btw are you saying you are using *_utf8 packs for your 1.5 production site?

Cheers,

Yu

In reply to Yu Zhang

Re: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by Christoffer Aronsson -
As far as I know we're using iso-8859-1, the courses in russian might be different, but all swedish content is written in iso-8859-1.

We're using the sv_SE languagepack at our 1.5-site. The *_UTF8 packs are only used on the 1.6 development server.

I've tried to run the upgrade script and specifying the language as sv, but that didn't help.


I've also been looking at the table-structures via mysql-admin and found that indeed all tables are using UTF-8 as the default character set. However, when viewing moodle via a web browser all characters are coded with ISO-8859-1. So (trying to get my head straight)  what I've got here is ISO-8859-1 characters stored in MySQL, that encodes them using UTF-8, right?

Regarding the question whether we're using some custom lang-packs: Yes, we have customized the sv_SE-pack on our 1.5 live-server, but that really isn't a problem, since as soon as we load up our 1.6 development-server (using only standard lang. packs) all languagepack text displays perfectly. And we're not planning to re-customize the sv_SE.UTF-8 lang. pack in the 1.6 distribution so that shouldn't become a future problem neither.


Looking at the facts, I'm guessing that the culprit here is MySQL, who's already using UTF-8 to encode the database - forcing the upgrade script to convert UTF8 to UTF8, resulting in corruption of the data? Or am I loosing my mind here? ;)


PS. Thanks for the book and questionnaire upgrade tips, will do. And to everyone else: Thank you ever so much for sticking with me here, really appreciate it.
In reply to Christoffer Aronsson

Re: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by Yu Zhang -
Hi,

Custom langs should not be a problem then. If *all* your contrib was working with no migration at all (which I have no idea why it would for special chars, maybe the isunicodedb is setting the proper settings), I suggest you try directly set the unicodedb flag in the mdl_config table to true. (Just make 1 if it's not there)

Or you can change the tables back to latin, as explained by dave.

Cheers,

Yu
In reply to Yu Zhang

Re: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by Christoffer Aronsson -
Hey,

I'll try that Yu and see where it takes me. I was also planning to change the defualt encoding - by hand - for each table in mysql-admin. Perhaps trying to change it from utf-8 to latin1 and THEN running the migration script?


FYI:
I tried to dump my database as dave suggested, converting it to latin1. That was all fine. The only problem was that as soon as I imported it back into MySQL (using linux console: mysql -u root -p <./moodle_dump.sql moodle) the tables were once again using UTF8 as default codepage, acording to Mysql-Admin. So that made little difference I am affraid.

Or perhaps I misunderstood dave and that he was suggesting that I FIRST convert the database using the upgrade script and THEN dump the db to latin1, and import it again?
In reply to Christoffer Aronsson

Re: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by dave streather -
Hi Christoffer

No, we didn't have to run the upgrade script we simply dumped the database with the relevant switch.

Sorry -  After checking  I've just found there were other switches we used, I hope I haven't inconvenienced you - the full command we used was

mysqldump -u idxxxxxx --password=ppppppp --compact -C -Q -e --default-character-set=latin1 moodledata > moodle-database.sql

From there the dump file was just loaded as per normal and all tables were correct.

cheers
dave
In reply to Christoffer Aronsson

Re: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by N Hansen -
I think Christoffer is onto something about the UTF-8 to UTF-8 conversion and MySQL, and I 'm convinced that is why I can't convert my own site properly to UTF-8. I filed bug 6161 several weeks ago about this with a link to a page that suggests that Christoffer's and my theory is correct. Yu, can you please confirm whether this is true or not?
In reply to N Hansen

Re: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by Yu Zhang -
Hi Nicole,

I have seen the bug report from you, and this is what we are already doing with the unicode conversion. We first convert to column type to  binary type (longblobs to be exact), then we proceed with the conversion.

The problem Christoffer had is probably because the setup_is_unicodedb() (in setup.php) is returning true as it sees the database encoding already in utf8 (most mysql uses latin1 as the default), and forcing everything to be stored/retrieved in unicode encoding . If this is the case, he could just run the migration with any global unicode language (say, en_utf8 will do, this would also help to convert the langugages in the database records from xyz to xyz_utf8, so he should do this). When we do the conversion, however, we have to get the charset and language associated with records from user, course or site languages stored in moodle database. So users in his site are likely to have se, en, or it as the user language. Migration will attempt to convert the records from iso-8859-1 to utf8, which is not needed and wrong.

I am not sure, if this is the similar situation you are experiencing Nicole, I suspect not because this should only happens with a very new moodle.

Edit: Christoffer, do you actually see the migration link in the admin page?

In reply to Yu Zhang

Re: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by Christoffer Aronsson -
Hey Yu,

I've been busy with some other projects the last few days and haven't been able to try your tips. However today I'll make a serious atempt at fixing this problem once and for all.

Some info:
I can see the upgrade database link via the admin page, this is how I execute the upgrade script.
I inspected the the unicodedb column in mdl_config and it is set to 'true'.


FYI I ran a breief test-upgrade yesterday, choosing en_UTF8 when the script asks what language the databse is encoded with. This resulted in a somewhat better result than before, with less corruption of our swedish content (excellent!).

However, I still noticed that some special characters such as ¿ etc. were still corrupted by the upgrade. But, as I said, this was only a quick and dirty test where the upgrade also managed to screw up our questionaire-content and some other content as well, so as a preliminary solution - selecting an *_UTF8 language, when prompted by the migration-script, is on it's way towards solving our problems. As I said, I'll be all over this today and will report any progress to you as we go.

Best regards.
In reply to Christoffer Aronsson

Re: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by N Hansen -
Christoffer-Tell us what happens when you create NEW content in the site after the upgrade. i seem to recall I had problems with this as well.
In reply to N Hansen

Re: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by Christoffer Aronsson -
Posting new content works fine.

After testing some more today I realized that choosing en_UTF8 as sitewide language when running the migration script only solved our corruption issued when it came to the lower-case letters (åäö) however all upper-case letters (ÅÄÖ) were corrupted. Ah, makes no sence to me..

And dumping/importing the database as suggested (after the migration) doesn't seem to correct the already corrupted data neither.

Phew.. I also think I'm a bit too tired right now to really wrap my head around this. I'll be back tomorrow though.

Thanks again for sticking with me on this.
In reply to Christoffer Aronsson

Re: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by Christoffer Aronsson -
Hey everyone,

I'd just like to give this thread a shameless push to the top and send out another distress call. I seem to have emptied most of my ideas and still gotten nowhere with this issue.

I'll set of for some studies at the university, starting monday next week and the teachers at work here are really keen on using moodle 1.6 on our production server while I'm of, so, any hints on how I should best proceed would be much appreciated. The deadline aproached with terrifying speed ;).

Anyways, I've been thinking of reconfiguring/reinstalling mysql so that it'll use Latin1 as default encoding to circumvent the problem, would that be an idea? Any hints on this?
In reply to Christoffer Aronsson

Re: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by dave streather -
Hi Christoffer

Did you try the full command line I gave in my last posting to you? and what happened?   What we did with our 2Gb database was to run the dump with that mysqldump command, then uploaded that dumpfile, then run the upgrade script.  We have done this several times successfully in our testing.

mysqldump -u idxxxxxx --password=ppppppp --compact -C -Q -e --default-character-set=latin1 moodledata > moodle-database.sql

Sorry I've no more to offer but that's what worked for us.

cheers
dave
In reply to dave streather

Re: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by Christoffer Aronsson -
Hey Dave,

Man, you just saved my day smile. I tried using the command line you posted to dump the data and then import the dumpfile via mysqls command line utility a couple of times - with no luck. Everytime I analysed a table after a fresh import, mysql-admin would report utf8 as the default character set. So I decided to analyse the dumpfile to see if I could find any hints, and I found some very interesting results:

(snip)...
ENGINE=MyISAM DEFAULT CHARSET=utf8;
...(snip)

Every create table in the dumpfile used that same argument, specifying utf8 as default charset, despite the fact that I used the --default-character-set=latin1 option in the command line. Very strange. Perhaps there's something wrong with my console settings that's affecting mysql-dump? Anyway, once the culprit was located I pulled some vim magic out of my hat ( :%s/CHARSET=utf8/CHARSET=latin1/g ) and imported the file again.

This time it actually seem to work, at least mysql-admin showed me that it was now using latin-*something* to encode the tables... I ran the migration script (crossing my fingers) and voila! Problem solved! No corruption, just glorious UTF8 all over the place. Well, at least as far as I can see at the moment. I haven't looked through the entire site yet, but this looks very promising.

Thank you ever so much for forcing me into trying that command line solution once again.
In reply to Christoffer Aronsson

Re: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by Paul Nijbakker -

Hej Christoffer,

You probably already did this, but I wanted to point out just in case, that you should get the latest versions of the book and questionnaire module (that are ready for moodle 1.6.) before proceeding with the moodle upgrade (and database conversion), so that Moodle will be able to convert the database tables for these modules too.

Rgrds,
Paul.

In reply to Christoffer Aronsson

Svar: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by Anders Berggren -
These non-standard modules were not deleted
from the original Swedish language pack. I cleaned
most of the pack before sending it to Koen Roggemans
for in-checking. He also did what he had time for but
some things were not dealt with. My CVS login data
are unfortunately no longer valid. Sorry for inconveniences
due to my 'sloppiness' leende. I have not had but a small portion
of this problem myself. Erase the non-standard files in the
lang pack and see what happens.

The Swedish translator
In reply to Anders Berggren

Re: Svar: Database conversion to UTF-8 corrupts swedish content. Hints anyone?

by Patrik Nilsson -
was it the languishpacks that was faulty?
I had the problem after converting to utf-8 database and cvs update.
But I did a clean cvs to 1.7 in a folder and used the old converted database and now it does store åäö properly. The old text in the database is still korrupt but everything new and changes of the old text is now stored correctly in the database and shows correctly on the page for me if anyone still got the problem this maybe is a hint. But I assume it was the packs that are now fixed?