5 Tips for migrating your MySQL server to a Galera Cluster

The momentum behind Galera is strong, as you can see on mailing lists1 or the standing room only lectures at Percona Live UK. If you follow the buzz you can also see that the discussion is increasingly turning from first time evaluations into people actually running Galera in production. This blog post is based on those experiences that people have shared on the Galera mailing list. I hope to provide some tips and best practices that will be helpful to you when migrating your existing MySQL database to a Galera Clustered HA setup.

MyISAM

MySQL got transactional storage engines in 2001, yet surprisingly many people still run with MyISAM tables. Well, it's not so surprising when you realize many popular PHP apps insist on creating MyISAM tables.

The advice here is to migrate to InnoDB already. You should have done it a long time ago anyway. (And since you asked, no, MyISAM isn't faster for read-heavy workloads anymore.)

Galera synchronous replication is really based on the idea of replicating transactions, and the ability to do rollbacks is a key part of what happens under the covers. So even if supporting MyISAM is possible, using non-transactional tables is just fundamentally at odds with the philosophy of Galera.

That said, in recent versions it is possible to replicate MyISAM tables by setting wsrep_replicate_myisam=ON. (This was implemented last Winter by request from Percona, and boy they were right, people really use MyISAM a lot still.) Please be adviced that this is an experimental feature, and as explained above, will never really be able to work as well and perform as well as using InnoDB tables. It's dangerous in many other ways too: for example non-deterministic functions are not protected for in any way (ie it is even more dangerous than SBR in MySQL classic replication). Still, this feature has turned out to be helpful for some users, for example if applications do occasional updates on some of the MyISAM tables in mysql.* database, which of course can't (yet?) be migrated to InnoDB. (Note that SQL statements like CREATE USER are considered DDL and always replicated by Galera. This issue only arises if you update mysql.user table directly with DML statement.)

There are still 2 valid reasons to use MyISAM tables: If you use fulltext indexes or spatial functions. In both of these cases updates to these tables are probably infrequent anyway, as MyISAM based applications tend to be read-mostly. In this case using wsrep_replicate_myisam can be the solution you need. In case of fulltext indexes I would also consider MySQL 5.6 a very good option: grab the Release Candidate that was just released, migrate your fulltext tables to InnoDB and wait a few months until we get Galera support for MySQL 5.6 ready.

Primary keys

Make sure all your tables have an explicit primary key defined. This is good design just in general, but lack of primary keys can have negative impact on your Galera synchronous replication experience. If you can't think of any other meaningful primary key, then simply add an auto_increment column. Even if your application won't use it, MySQL and Galera will work better with it.

I won't go into details (because I don't care to research them) but the issue is that some Row-Based-Replication events cause full table scans on slaves when the table doesn't have a primary key. Remember that Galera is a tightly-coupled synchronous replication cluster, and you'll realize that these full table scans on the slave will in fact quickly block all nodes in the whole cluster. (To compare, with MySQL master-slave replication you are probably experiencing this same issue, but in that case you just get some minutes of slave lag, while the master looks to be fully operational as far as you're concerned. In Galera design we are against slave lag, so the blocking behavior is by design. Add a primary key and you have fixed the root cause.)

A few people had already migrated to a Galera Cluster and experienced the issues due to tables lacking a primary key. This creates a related problem: If you try to fix the issue and do ALTER TABLE ... ADD PRIMARY KEY ...auto_increment on a Galera Cluster with default settings, you are in for a surprise. The existing rows in that table will get different primary key values on each node in the cluster. If you think that's probably bad... well, you're right!

Quite soon after this, Galera replication will fail due to inconsistent data on the nodes (ie trying to replicate an update a row that doesn't exist on any of the slaves). But all is not lost: Galera Cluster is designed to survive rough treatment: the slaves will immediately shut down as a measure to protect your data integrity, leaving you with only one node running (the one that happened to be the master for that transaction). If you then restart the slaves, they will do a full SST, and your data integrity is restored and everything is ok again.

The culprit to all this lies in how Galera automatically handles auto_increment_increment and auto_increment_offset. If you ever need to add an auto_increment primary key on a running Galera cluster, you need to turn off this feature by setting wsrep_auto_increment_control=OFF. While in that state, make sure you direct all writes to a single master. When done with the alter table, you probably want to turn it back on again. Note that this is only an issue when adding a primary key to non-empty tables. All other ALTER TABLE operations are fine.

We will of course implement a fix to protect from this in future Galera versions. But until then, and even then, just make sure all your tables have a primary key from day one.

Events and triggers

Two features that are supported perfectly well by Galera are Events and Triggers. You can use them and they will do something. But do you actually know what they will do when you run them in a multi-master cluster? For that matter, do you know what they do on a traditional MySQL replication master-slave cluster? I don't. So I usually advice people not to use them. You should not use features that you don't understand.

Use cron for events and try to design your application to work without triggers.

Jay Janssen disagreed with me with regards to triggers. They should work sanely with Row-Based-Replication and thus with Galera too. "Sanely" in this context means that they execute on the master and not on the slave. Still, if you choose to believe Jay here, my advice remains: Don't do something you don't understand. Make sure you test it thorougly so you know what your triggers are really doing. (Jay agreed with me on not using events.)

How to do online migration

Mixing Galera synchronous replication with MySQL asynchronous replication is possible and officially supported. Just make sure log-slave-updates is enabled on all servers. This comes in handy if you have an existing MySQL database (or a master-slave pair) and you want to migrate to a brand new Galera cluster without downtime.

Just setup a new 3 node Galera cluster first. Then take a backup of the existing MySQL server, and provision the Galera cluster with that backup. Now setup MySQL replication from the current MySQL server to one of the nodes in the Galera cluster. There's really nothing special to it, this is the same configuration you would always do when setting up a MySQL slave. You can forget the 2 other Galera nodes are even there.

You can also setup replication in the other direction: Galera node as the master and MySQL node as the slave. This provides for a nice rollback strategy in case something unexpected happens with the migration: Just fall back to using the original MySQL server while you figure out what the problem was.

How to do online migration has been excellently covered - pictures and all - in Jay Janssen's webinar: Migrating to XtraDB Cluster.

Take baby steps

If you now go through your application and realize you'll have to take actions on several points I'd listed above, let me offer one final point of advice: Break your migration into pieces, don't do all in one day. For example, migrate to InnoDB tables first, run in production a few weeks, and then start your Galera migration.

Why? Because migrations of any kind may result in some surprising behavior, bugs, even servers blocking or crashing. If you do all changes at once, and then face some problems, troubleshooting will be harder because it is not easy to know whether you should blame InnoDB, Galera or something else completely. If you do only one major change at once, it is easy to suspect the latest change if something goes wrong.


1) I recently generated some statistics on the mailing lists I follow, which indicate that in recent months the Codership Galera mailing list has been neck-and-neck with mysql-discuss mailing list for the position of most active mailing list in the MySQL ecosystem. We knew it was getting busy, but hadn't quite realized the significance of it!