What we'll be talking here is a generic pluggable replication architecture of the form:
clients | | | V V V ,----------------. | | | application | <-- e.g. MySQL server | | ================== <-- wsrep API | wsrep provider | <-- e.g. Galera `----------------' | V replication to other nodes
So we start off with wsrep API
wsrep API is a project to develop generic replication plugin interface for databases and similar applications. wsrep stands for WriteSet Replication. It defines a set of application callbacks and replication library calls. The library that provides wsrep API calls is called wsrep provider and can be dynamically loaded by the wsrep-enabled application.
wsrep API describes the following replication model. An application (e.g. a database server) has a state (e.g. contents of the database) which is being modified by clients. This changing of the state is represented as a series of atomic changes (transactions). In a cluster all nodes have the same state which they synchronize by replicating and applying state changes in the same serial order.1)
To this end wsrep API introduces a global transaction ID (GTID) which is used to both
GTID consists of
Thus GTID allows to compare application states, establish order of state changes, determine whether the change was applied or not and whether it is applicable at all to a given state (in short, it is all-powerful). In human readable form GTID might look like
Galera is a wsrep provider that implements true multi-master virtually synchronous replication. Its main features include:
Galera cluster is built on top of a proprietary group communication system layer which implements virtual synchrony QoS. Virtual synchrony unifies data delivery and cluster membership service which gives clear formalism regarding message delivery semantics. It also provides total ordering of messages from multiple sources which is very handy in building global transaction IDs in multi-master cluster.
At the transport level Galera cluster is a complete symmetric undirected graph, that is each node is connected to each other node of the cluster by a single TCP connection. By default TCP is used for both message replication and cluster membership service, but beginning with 0.8.0 UDP multicast can be used for replication in LAN.
Each node in Galera cluster monitors the liveness of all other nodes in the cluster through keepalive messages. The frequency of keepalives and miss sensitivity can be adjusted to tolerate less reliable networks.
In addition to single node failures the cluster may be split into several components due to network failure. A component is a set of nodes which are connected to each other and not to nodes from other components3). In such situation only one of the components can continue to modify the state to avoid history divergence. This is called a primary component (PC), and in normal operation Galera cluster is one whole PC. When the cluster partitioning happens Galera invokes a special quorum algorithm to select a PC that guarantees that there is no more than one primary component in the cluster.
Like any quorum-based system, Galera cluster is subject to split-brain condition when quorum algorithm fails to select a primary component. This can happen, for example, in a cluster without the backup switch if the main switch fails. But the most likely split brain situation is when single node fails in a two-node cluster. Thus it is strongly advised that the minimum Galera cluster configuration is 3 nodes. In the state transfer section below we will consider one more reason why 3 is the minimum recommended number of nodes.
Virtual Synchrony guarantees consistency but not temporal synchrony, which is generally desired and required for smooth multi-master operation. For that purpose Galera implements its own temporal flow control which keeps nodes synchronized to a fraction of second. It is runtime configurable and can be relaxed for master-slave setups.
Galera supports full multi-master operation, in the sense that is all nodes can concurrently modify the same table and still stay consistent. Galera achieves that by detecting conflicts using certification algorithm and rolling back victim transactions 4). The efficiency of multi-master replication depends on the conflict rate which in turn depends on the application load profile, load balancing strategies and the number of nodes involved. In some cases reducing the number of nodes which can simultaneously accept writes can improve performance.
Since Galera can detect conflicts between writesets, it can also detect which writesets can be applied concurrently, so… Yeah, simple as that.
MySQL-wsrep is a patch for MySQL RDBMS that enables it to use wsrep providers. It can be paired with Galera to create MySQL/Galera cluster which offers
Currently MySQL/Galera cluster supports only InnoDB storage engine.
During normal operation MySQL/Galera node does not consume much more memory than a regular MySQL server. Additional memory is consumed for certification index and uncommitted writesets, but normally this should not be noticeable in a typical application. There is one exception though:
To control memory usage for writeset caching, check Galera parameters:
To bootstrap a new cluster you need to start the first mysqld server with an empty cluster address URL specified either in
my.cnf or on the command line:
$ mysqld --wsrep_new_cluster
This implies to the server that there is no existing cluster to connect to and it will create a new history UUID.
Restarting the server with the same setting will cause it to create new history UUID again, it won't reconnect to the old cluster. See next section about how to reconnect to existing cluster.
Older versions used empty
gcomm:// address to create a cluster. This behaviour is deprecated. Never put empty gcomm address in
my,cnf! Use –wsrep-_new_cluster option on the command line instead.
Once you have a cluster running and you want to add/reconnect another node to it, you must supply an address of one of the cluster members in the cluster address URL. E.g. if the first node of the cluster has address 192.168.0.1, then adding a second node would look like
$ mysqld --wsrep_cluster_address=gcomm://192.168.0.1 # DNS names work as well
The new node only needs to connect to one of the existing members. It will automatically retrieve the cluster map and reconnect to the rest of the nodes.
Once all members agree on the membership, state exchange will be initiated during which the new node will be informed of the cluster state. If its state is different from that of the cluster (which is normally the case) it will request snapshot of the state from the cluster 5) and install it before becoming ready for use.
There are two conceptually different ways to transfer a state from one MySQL server to another:
MySQL/Galera currently comes with the following following SST scripts:
This is a default method. This script runs only on the sending side and pipes mysqldump output to mysql client connected to receiving server. Nothing special there.
This is probably the fastest way to transfer server state snapshot from one server to another. The script runs on both sending and receiving sides. On receiving side it starts rsync in server mode and waits for connection from sender. On the sender side it starts rsync in client mode and sends contents of the MySQL data directory to the joining node. This method is also blocking but is much faster than mysqldump. rsync_wan method is optimized for transfers over WAN by minimizing the amount of data transferred.
To use a particular state transfer method set (on a joining node) wsrep_sst_method variable to the name of the method, like:
It is easy to see that a failure in state transfer generally renders receiving node unusable. Therefore, should the failure be detected, it will abort. Restarting node after mysqldump failure may require manual restoring of the administrative tables. rsync method does not have this issue since it does not need the server to be in operational state.
It's already been stated above that, in order to avoid split-brain condition, the minimum recommended number of nodes in the cluster is 3. Blocking state transfer is yet another reason to require minimum 3 nodes in order to enjoy service availability in case one of the members fails and needs to be restarted. While two of the members will be engaged in state transfer, the remaining member(s) will be able to keep on serving client requests.
While every effort is made to get MySQL/Galera node to work out of the box, still few parameters need to be set to get it working.
wsrep_provider— a path to Galera library.
wsrep_cluster_address— cluster connection URL.
When using Galera provider of version >= 2.0 innodb_doublewrite should always be set to 1 (default)
Optional settings (these are just optimizations made relatively safe by synchronous replication — you always recover from another node):
mysql> SHOW STATUS LIKE 'wsrep_%';
query. See this page for detailed variable description.