Table of Contents
What we'll be talking here is a generic pluggable replication architecture of the form:
clients | | | V V V ,----------------. | | | application | <-- e.g. MySQL server | | ================== <-- wsrep API | wsrep provider | <-- e.g. Galera `----------------' | V replication to other nodes
So we start off with wsrep API
wsrep API is a project to develop generic replication plugin interface for databases and similar applications. wsrep stands for WriteSet Replication. It defines a set of application callbacks and replication library calls. The library that provides wsrep API calls is called wsrep provider and can be dynamically loaded by the wsrep-enabled application.
Global Transaction ID (GTID)
wsrep API describes the following replication model. An application (e.g. a database server) has a state (e.g. contents of the database) which is being modified by clients. This changing of the state is represented as a series of atomic changes (transactions). In a cluster all nodes have the same state which they synchronize by replicating and applying state changes in the same serial order.1)
To this end wsrep API introduces a global transaction ID (GTID) which is used to both
- identify the state change
- identify the state itself by the ID of the last state change.2)
GTID consists of
- a state UUID which uniquely identifies the state and the sequence of changes it undergoes (thus sometimes it may be referred to as sequence or history UUID) and
- an ordinal sequence number (seqno, 64-bit signed integer) to denote the position of the change in the sequence.
Thus GTID allows to compare application states, establish order of state changes, determine whether the change was applied or not and whether it is applicable at all to a given state (in short, it is all-powerful). In human readable form GTID might look like
Galera wsrep provider
Galera is a wsrep provider that implements true multi-master virtually synchronous replication. Its main features include:
- Virtually Synchronous: transaction committed on one node is guaranteed to be committed on all other nodes of the cluster. As a result no committed transaction is lost in the event of a node failure.
- True Multi-Master: the same table can be simultaneously modified on all nodes of the cluster. As a result no master failover is required.
- True Parallel Applying: it works even on a single table (beginning with 0.8).
- No Single Point of Failure
Galera cluster is built on top of a proprietary group communication system layer which implements virtual synchrony QoS. Virtual synchrony unifies data delivery and cluster membership service which gives clear formalism regarding message delivery semantics. It also provides total ordering of messages from multiple sources which is very handy in building global transaction IDs in multi-master cluster.
At the transport level Galera cluster is a complete symmetric undirected graph, that is each node is connected to each other node of the cluster by a single TCP connection. By default TCP is used for both message replication and cluster membership service, but beginning with 0.8.0 UDP multicast can be used for replication in LAN.
Node liveness monitoring
Each node in Galera cluster monitors the liveness of all other nodes in the cluster through keepalive messages. The frequency of keepalives and miss sensitivity can be adjusted to tolerate less reliable networks.
Primary Component (PC)
In addition to single node failures the cluster may be split into several components due to network failure. A component is a set of nodes which are connected to each other and not to nodes from other components3). In such situation only one of the components can continue to modify the state to avoid history divergence. This is called a primary component (PC), and in normal operation Galera cluster is one whole PC. When the cluster partitioning happens Galera invokes a special quorum algorithm to select a PC that guarantees that there is no more than one primary component in the cluster.
Like any quorum-based system, Galera cluster is subject to split-brain condition when quorum algorithm fails to select a primary component. This can happen, for example, in a cluster without the backup switch if the main switch fails. But the most likely split brain situation is when single node fails in a two-node cluster. Thus it is strongly advised that the minimum Galera cluster configuration is 3 nodes. In the state transfer section below we will consider one more reason why 3 is the minimum recommended number of nodes.
Virtual Synchrony guarantees consistency but not temporal synchrony, which is generally desired and required for smooth multi-master operation. For that purpose Galera implements its own temporal flow control which keeps nodes synchronized to a fraction of second. It is runtime configurable and can be relaxed for master-slave setups.
Galera supports full multi-master operation, in the sense that is all nodes can concurrently modify the same table and still stay consistent. Galera achieves that by detecting conflicts using certification algorithm and rolling back victim transactions 4). The efficiency of multi-master replication depends on the conflict rate which in turn depends on the application load profile, load balancing strategies and the number of nodes involved. In some cases reducing the number of nodes which can simultaneously accept writes can improve performance.
Since Galera can detect conflicts between writesets, it can also detect which writesets can be applied concurrently, so… Yeah, simple as that.
Configuration and Status Variables
Series 0.8.x and later
MySQL-wsrep is a patch for MySQL RDBMS that enables it to use wsrep providers. It can be paired with Galera to create MySQL/Galera cluster which offers
- High-availability out of the box
- No Single Point Of Failure
- Unsurpassed performance
- Application compatibility
- Ease of use
Currently MySQL/Galera cluster supports only InnoDB storage engine.
Swap Size Requirements
During normal operation MySQL/Galera node does not consume much more memory than a regular MySQL server. Additional memory is consumed for certification index and uncommitted writesets, but normally this should not be noticeable in a typical application. There is one exception though:
- Writeset caching during state transfer. When a node is receiving a state transfer it cannot process and apply incoming writesets because it has no state to apply them to yet. Depending on a state transfer mechanism (e.g. mysqldump) the node that sends state transfer may not be able to apply writesets as well. Thus they need to cache those writesets for a catch-up phase. Currently the writesets are cached in memory and, if the system runs out of memory either state transfer will fail or the cluster would block waiting for the state transfer to end.
To control memory usage for writeset caching, check Galera parameters:
Bootstrapping a new cluster
To bootstrap a new cluster you need to start the first mysqld server with an empty cluster address URL specified either in
my.cnf or on the command line:
$ mysqld --wsrep_cluster_address=gcomm://
This implies to the server that there is no existing cluster to connect to and it will create a new history UUID.
Restarting the server with the same setting will cause it to create new history UUID again, it won't reconnect to the old cluster. See next section about how to reconnect to existing cluster.
Only use empty
gcomm:// address when you want create a NEW cluster. Never use it when you want to reconnect to the existing one.
Adding another node to a cluster
Once you have a cluster running and you want to add/reconnect another node to it, you must supply an address of one of the cluster members in the cluster address URL. E.g. if the first node of the cluster has address 192.168.0.1, then adding a second node would look like
$ mysqld --wsrep_cluster_address=gcomm://192.168.0.1 # DNS names work as well
The new node only needs to connect to one of the existing members. It will automatically retrieve the cluster map and reconnect to the rest of the nodes.
Once all members agree on the membership, state exchange will be initiated during which the new node will be informed of the cluster state. If its state is different from that of the cluster (which is normally the case) it will request snapshot of the state from the cluster 5) and install it before becoming ready for use.
State Snapshot Transfer
There are two conceptually different ways to transfer a state from one MySQL server to another:
- Using mysqldump. This requires the receiving server to be fully initialized and ready to accept connections before the transfer. This method is by definition blocking, in that it blocks donor server from modifying its own state for the duration of transfer. It is also the slowest of all, and in the loaded cluster that might be an issue in a loaded cluster. This method is supported in MySQL/Galera since 0.7 release.
- Copying data files directly. This requires that the receiving server is initialized after the transfer. rsync, xtrabackup and other methods fall into this category. These methods a much faster than mysqldump, but they have certain limitations, like they can be used only on server startup and receiving server must be configured very similarly to the donor (e.g. innodb_file_per_table should be the same and so on). Some of these methods, e.g. xtrabackup, can be potentially made non-blocking on donor. Such methods are supported starting with MySQL/Galera 0.8 series via scriptable interface.
MySQL/Galera currently comes with the following following SST scripts:
This is a default method. This script runs only on the sending side and pipes mysqldump output to mysql client connected to receiving server. Nothing special there.
rsync/rsync_wan (starting with 0.8)
This is probably the fastest way to transfer server state snapshot from one server to another. The script runs on both sending and receiving sides. On receiving side it starts rsync in server mode and waits for connection from sender. On the sender side it starts rsync in client mode and sends contents of the MySQL data directory to the joining node. This method is also blocking but is much faster than mysqldump. rsync_wan method is optimized for transfers over WAN by minimizing the amount of data transferred.
To use a particular state transfer method set (on a joining node) wsrep_sst_method variable to the name of the method, like:
State transfer failure
It is easy to see that a failure in state transfer generally renders receiving node unusable. Therefore, should the failure be detected, it will abort. Restarting node after mysqldump failure may require manual restoring of the administrative tables. rsync method does not have this issue since it does not need the server to be in operational state.
Minimal cluster size
It's already been stated above that, in order to avoid split-brain condition, the minimum recommended number of nodes in the cluster is 3. Blocking state transfer is yet another reason to require minimum 3 nodes in order to enjoy service availability in case one of the members fails and needs to be restarted. While two of the members will be engaged in state transfer, the remaining member(s) will be able to keep on serving client requests.
Configuration and Monitoring
While every effort is made to get MySQL/Galera node to work out of the box, still few parameters need to be set to get it working.
wsrep_provider— a path to Galera library.
wsrep_cluster_address— cluster connection URL.
When using Galera provider of version >= 2.0 innodb_doublewrite should always be set to 1 (default)
Optional settings (these are just optimizations made relatively safe by synchronous replication — you always recover from another node):
- Wsrep-related status variables in MySQL/Galera server can be queried with the standard
mysql> SHOW STATUS LIKE 'wsrep_%';
query. See this page for detailed variable description.
- Notification command can be defined to be invoked when cluster membership or node status changes. It can communicate the event to a some monitoring agent.