conversion from utf8 to utf8mb4 - which columns?

conversion from utf8 to utf8mb4 - which columns?

by Josiah Carberry -
Number of replies: 10

I note the message during the upgrade from 3.2.3+ to 3.3 that my db with its utf8 encoding might not be able to accommodate some new characters that require 4 byte encoding, i.e., utf8mb4.

Looking into what is required, I discover the warning that changing from utf8 to utf8mb4 might require adjustment of the size of the column, given that some characters are encoded with up to 4 bytes, as opposed to the 3 byte max. in utf8.

The problem is to identify which columns might be concerned by this issue and what assumptions the Moodle code is making. For example, if a column is defined as varchar(100), it could theoretically hold 33 characters of 3 bytes each, whereas with utf8mb4, the maximum would be 25 characters of 4 bytes each. So if a user enters 33 characters encoded with 4 bytes each, the update will fail, as unlikely as that might be. But does Moodle check this sort of thing? What margin of error is included in the definitions of the columns?

It sure would be nice to get clear guidance about which columns are concerned and whether it is expected that any changes in column size would be needed.

Average of ratings: -
In reply to Josiah Carberry

Re: conversion from utf8 to utf8mb4 - which columns?

by Bret Miller -
Picture of Particularly helpful Moodlers

I haven't yet done this either, but according to the documentation (https://docs.moodle.org/31/en/MySQL_full_unicode_support), 

"Moodle comes with a Command Line Interface (CLI) script for converting to full UTF-8 for MySQL (and MariaDB). Before Moodle versions 3.1.5 and 3.2.2 this conversion tool would only change the Collation to some variant of 'utf8_bin'. 'utf8_unicode_ci' was the recommended Collation. We now recommend using 'utf8mb4_unicode_ci' which supports four byte characters (utf8_unicode_ci only supports three).

This script will attempt to change the database Collation, Character set, default table settings and column definitions."

 sets the database character set to utf8mb4 and the collation, and sets every table and column that has a collation to utf8mb4_unicode_ci which you could do manually using phpMyAdmin. Note that it also requires changes to the MySQL or MariaDB configuration so that is the default collation and utf8mb4 is the default character set going forward.

The SQL statements are:

ALTER DATABASE $CFG->dbname DEFAULT CHARACTER SET $charset DEFAULT COLLATE = $collation

ALTER TABLE $table->name DEFAULT CHARACTER SET $charset DEFAULT COLLATE = $collation

ALTER TABLE $table->name

                        MODIFY COLUMN $column->field $column->type

                        CHARACTER SET $charset

                        COLLATE $collation $notnull $default

Honestly, the cli script would be easiest. You might also be able to google how to do it in all sql.

In reply to Bret Miller

Re: conversion from utf8 to utf8mb4 - which columns?

by Josiah Carberry -

Thanks for this, Bret. But my question isn't about how to change the encoding (and collation). I already have a complete script to do that. And I note that the CLI script has various constraints, such as refusing to run unless the db is configured with Barracuda, but that is neither here nor there.

In reply to Josiah Carberry

Re: conversion from utf8 to utf8mb4 - which columns?

by Bret Miller -
Picture of Particularly helpful Moodlers

Well, according to the doc, it's optional at this point. UTF8 is still acceptable. Judging from my experience with compressed rows, if you don't get everything, you'll run into the issue again at some point so it's probably not a matter of which columns need it, but rather how Moodle checks for the condition. Hopefully, it's done per column to determine whether the 4-byte character set is present to give the users the features it provides. I rather doubt the developers would take the time to list the tables and columns required for each feature for which it's necessary. All or nothing is safer to prevent the case when one more column was required than was documented.

Just my $.02.

The script does take a long time. But the alternative of dumping the database, fixing the SQL and re-importing it also takes a while and that would seem to have a more direct effect on using the site. 

In reply to Bret Miller

Re: conversion from utf8 to utf8mb4 - which columns?

by Rick Jerz -
Picture of Particularly helpful Moodlers Picture of Testers

Bret, I am now experimenting with this conversion process.  Maybe you or others know the answers to these questions.

1) I tried dumping the database and editing the sql file.  I replaced every  CHARSET=utf8 with  CHARSET=utf8mb4, and every COLLATE=utf8_unicode_ci with COLLATE=utf8mb4_unicode_ci.  I changed the collation on my MAMP database to utf8mb4, then reimported the sql file.  Everything was looking fine, but when I finally upgraded to Moodle 3.3, the environment still said that I needed to covert my tables.

What did I miss or do wrong?  Are there more find/replaces needed?

(I am now noticing lines such as:

`name` varchar(28) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

My guess is that all of these "utf8_Unicode_ci" need to be replaced, right?)

2) So now I am running the cli script, and yes, slow but progressing.  Do you know if this script can be run while students are using my production moodle?  Or must I first put Moodle into Maintenance Mode?

Also, I am running this script on the same database that I thought I had converted in my question #1.  The script shows that tables are still being "converted."  So it appears that my technique in Question #1 really didn't have success.


It seems to me that the find/replace can be a lot quicker, but one would really have to put their moodle into maintenance mode for this method.


In reply to Rick Jerz

Re: conversion from utf8 to utf8mb4 - which columns?

by William Stewart -

Rick,

I have run this CLI script several times for learning/practicing purposes. I upgraded my MAMP installation last night to 3.3 and wow was it really slow to use Moodle while the script was running. It takes a long time, particularly when hitting the mdl_logstore table. Previously I just let it run to see the output and never ran into any errors-output was always successful. (And either way, I am on a VPS so I don't need to worry about NOT being able to change DB settings should I need to down the line).

As per Moodle documentation, I executed : php mysql_collation.php --collation=utf8mb4_unicode_ci

And everything was okay without having to adjust any settings manually- that is, the script worked successfully. To verify, I also ran the check_database_schema.php to see any output though this relates more to upgrading plugins and moodle.

My sql dump is around 250mb and it takes around 30 minutes on my Macbook pro. I shudder when I think that our db used to come in at around 1Gb and imagine that it would take potentially 2 hours?!

I asked our host about this some time ago and he recommended simply uploading the converted db onto the server and then simply pointing to that in the config file. Given how long it takes to convert everything, you could run it on the production server when you expect there to be little to no use. Otherwise, you could do the conversion on MAMP, upload to production SQL server, and point to that-but again, this would most likely need to be done at a late hour as well since you don't want to lose potential significant activity. I'm probably going to try that route out since it takes so long prior to upgrading to 3.3 later in June.


Average of ratings: Useful (1)
In reply to William Stewart

Re: conversion from utf8 to utf8mb4 - which columns?

by Rick Jerz -
Picture of Particularly helpful Moodlers Picture of Testers

William, I finished my MAMP conversion.  My database (dump) is around .5GB.  It took the conversion script around 45 minutes to complete.  Yes, mdl_logstore took the longest.

I am gaining confidence to do this on my production Moodle VPS.  I am considering doing it via a CRON job that runs around 2am.  Did you make the recommended changes to my.cnf?  I did on my MAMP install, and everything worked fine.  I am confident that these changes will work well on my VPS.

Do you think that there is any advantage of doing the utf8 t utf8mb4 conversion prior to upgrading from Moodle 3.2 to Moodle 3.3?  Does order make any difference?  Today, in MAMP, I upgraded to 3.3 first, then converted from utf8 to utf8mb4.

Thanks for your thoughts.

In reply to Rick Jerz

Re: conversion from utf8 to utf8mb4 - which columns?

by William Stewart -

Rick,

I didn't make the changes to my.cnf only because I was experimenting first to see what the output of the conversion script would be. Had no errors. Of course, this might be different on the production server, however, if I can't figure out exactly how to make those changes, I'll simply ask our host to do it on my behalf smile.

As for the order, based on what I've read, all new 3.2 installations are set up by default at utf8mb4. This really only effects sites that have been around prior to that. This is true of mine, and presumably your site as well. Thus some in-between utf character encoding set is used that provides most coverage for characters, but not all such as various emojis or Asian language scripts.

As far as I understand, you can do the conversion now and now worry about it later. I did the conversion on a 3.2.2+ build in my MAMP and no issues, sans how long it takes.

In reply to William Stewart

Re: conversion from utf8 to utf8mb4 - which columns?

by Rick Jerz -
Picture of Particularly helpful Moodlers Picture of Testers

William, thanks for your reply.

This morning, I added the few more requirements in my.cnf.  Restarted MySQL.

Thank, I updated from utf8 to utf8mb4.  No problem, but it did take 30 minutes.

Soon I will be upgrading my Moodle to 3.3.

In reply to Josiah Carberry

Re: conversion from utf8 to utf8mb4 - which columns?

by Marina Glancy -
Picture of Core developers Picture of Moodle HQ Picture of Moodle Workplace team Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers Picture of Testers

The capacity of the columns will not change after changing the collation. 

We are now investigating at Moodle HQ the possibilities to speed up the conversion process. There are some interesting comments on MDL-58729 that suggest that using CONVERT will be faster than using MODIFY COLUMN

In reply to Josiah Carberry

Re: conversion from utf8 to utf8mb4 - which columns?

by Adrian Greeve -
Picture of Core developers Picture of Moodle HQ Picture of Peer reviewers Picture of Plugin developers Picture of Plugins guardians Picture of Testers

There have been numerous Asian characters that are not supported by the three byte MySQL utf-8. This may not impact a lot of sites that are not located in Asia, but there has been an increase in emoji being introduced and these are using 4 bytes as well. MySQL does not handle saving four byte characters in the three byte system well (it sends back an error and nothing is saved).

I'll try to answer the questions asked in this discussion.

Which columns need to be changed?

All the columns that could possibly contain 4 byte characters need to use utf8mb4. Having some columns with a collation of utf8 and others using utf8mb4 is likely to run into problems at some point. The safest solution to make sure there are no exceptions is to convert the whole database.


Why do I need barracuda, compressed row format, and large prefix settings?

We have indexes in Moodle on varchar columns. A lot of these columns are set to a length of 255. Moving over to 4 bytes means that the size of the index needs to be reduced, or we need to use dynamic rows and long prefixes. The first option was discarded, as information in those columns has the possibility of using the full 255 length and that information would have to be concatenated and lost to fit the new length, also the indexes for those varchar columns would have to be redefined with the new size limit; The second option kept all the information in the tables, and the indexes for the varchar columns did not have to be redefined. To use the second option it is required to have a compressed row format and large prefix settings enabled. This means that the row format must use 'Barracuda'.


First option documentation here https://dev.mysql.com/doc/refman/5.6/en/charset-unicode-conversion.html

The second option is located at the same address. Please read the second 'Note' on the page.


Can I run the script while users are accessing my site?

I wouldn't recommend having the script running while the site is still active. I think that there is a high possibility of errors occurring if the database is doing both and the possibility of losing data. To avoid problems it is recommended to put your site into Maintenance mode


Should I upgrade to the next version of Moodle before running the conversion script?

The order in which you run the script and upgrade to the next version of Moodle makes no difference.


The conversion tool is too slow.

Yes, we are looking at ways to speed up the conversion process. Please follow MDL-58729

Average of ratings: Useful (2)