Table of Contents
Frequently Asked Questions
Q: What is Galera?
Galera is a wsrep service provider in a form of dlopenable library. It provides synchronous replication guarantees and supports multi-master1) replication. Among other things Galera is capable of unconstrained parallel applying AKA “parallel replication”, multicast replication and automatic node provisioning. Galera primary focus is data consistency: transaction is either applied on every node or not at all. See CaP theorem for consequences.
Galera is not
- cluster manager
- load balancer
- cluster monitor
What it does is keeping databases synchronized provided they were properly configured and synchronized in the beginning.
Q: No, what is "galera"?
A: “Galera” is another (more original) way to name a “galley”.2) Galleys were propelled by oars and “scaled” up to hundreds of rowers who had to act synchronously lest the oars intertwine and get blocked.
Q: How do I do a failover?
A: Galera is a true multi-master synchronous replication system which allows to use any or all of the nodes as master at any time without any extra provisions, so there is no “failover” in the “traditional” MySQL master-slave sense. Galera foremost goal is data consistency across the nodes and it simply won't allow any modifications to the database if it may compromise consistency. E.g. it will block or reject write request until the joining node is synced with the cluster and is ready to process local requests.
As a result you can safely use your favorite approach to distribute/migrate connections between the nodes without risk to cause inconsistency. Some discussion on connection distribution can be found here.
Q: How to upgrade cluster?
A: Cluster upgrade can be done applying the following steps to each cluster node:
- Shutdown the node
- Upgrade software
- Join the node to the cluster
Of course, you should take care about transferring clients connections from node which is being upgraded to another nodes for the time of migration.
Q: What InnoDB Isolation Levels does Galera support?
A: All levels can be used because locally, in one node, transaction isolation works as with native InnoDB.
But cluster wide, between transactions processing in separate nodes, Galera implements transaction level called “SNAPSHOT ISOLATION”. The SNAPSHOT ISOLATION level is between REPEATABLE READ and SERIALIZABLE levels.
SERIALIZABLE level cannot be guaranteed in multi-master use case, because Galera replication does not carry transaction read set. Also, SERIALIZABLE transaction is vulnerable for multi-master conflicts. It holds read locks and any replicated write to read locked row will cause transaction abort. Hence it is recommended not to use it in Galera Cluster.
Q: SELECT ... FROM ... returns "Unknown command"?
A: This happens when you explicitly specify
wsrep_provider variable, i.e. demand using wsrep provider services, but provider for some reason rejects it. With Galera it may be because the node is not connected to the cluster primary component (
wsrep_cluster_address unset, networking issues). In such case the node is considered unsynced with global state and unable to serve SQL requests except
SHOW. If you wish to bypass this check, switch wsrep service off by
mysql> SET wsrep_on=0;
This will make
mysqld to ignore
wsrep_provider setting and behave as a standalone MySQL server. Note that this may lead to data inconsistency with the rest of the cluster (which may be a desirable effect, e.g. modifying “local” tables).
Q: How DDLs are handled by Galera?
A: For DDLs and similar queries Galera has a special mode of execution called Total Order Isolation (TOI) where the query is replicated in a statement form before executing on master, waits for all preceding transactions to commit and then gets executed in isolation on all nodes simultaneously. The isolation can of the following levels:
- Server-level, where no other transactions can be applied concurrently (for CREATE SCHEMA, GRANT and similar queries).
- Schema-level, where no transaction accessing the schema can be applied concurrently (for CREATE TABLE and similar queries).
- Table-level, where no transaction accessing the table can be applied concurrently (for ALTER TABLE and similar queries).
There are several particularities about TOI queries:
- they never conflict with preceding transactions since they are executed only after all preceding transactions were committed, hence their certification interval is of 0 length. That means that they will never fail certification and are guaranteed to be executed.
- certification happens on a resource level, which means that e.g. for server-level isolation any transaction that has TOI query in its certification interval will fail certification.
- Error handling: since we replicate TOI query before execution, there is no way to know whether it should succeed or fail. So error checking on TOI queries is essentially turned off.
Q: In a 2-node cluster one node crashed/disconnected and then remaining node stopped working. What gives?
A: In a two node cluster, when one node ungracefully leaves the cluster, the remaining node is unable to tell whether disappeared node has actually crashed or just lost connection without some additional information from a 3rd party (like human operator or another Galera node). Such situation is called split-brain because if in the case of connection loss the nodes continue to operate without being able to synchronize, their databases will become inconsistent. Thus, when a node loses connection to it's only peer it stops accepting queries to avoid database inconsistency. To bring it back to operational mode it must be instructed that it is the only node in the cluster by setting
wsrep_cluster_address variable accordingly:
mysql> SET GLOBAL wsrep_cluster_address='gcomm://';
If it is desirable to allow the node to operate even in split-brain situation (for example the cluster is used in a master-slave mode, so it is guaranteed that only one node accepts writes), it can be enabled by setting
pc.ignore_sb Galera option, like:
wsrep_provider_options="pc.ignore_sb = yes"
in my.cnf file.
Another option is to use Galera Arbitrator as a 3rd “node” to avoid split-brain situation altogether.
Q: Nothing works, damnit!
A: Real life experience shows that usually it is due to one or more of the following 3 reasons:
- SELinux settings (mostly RHEL/CentOS)
- AppArmor settings (Ubuntu)
- Iptables settings (any distro)
While it is beyond this document to detail how to properly configure any of the above, here are some quick tips to get you going:
To disable SELinux use the following command (as
# setenforce 0
To disable it permanently, set
This seems to be Ubuntu-only thing. Disable AppArmor for mysqld:
$ cd /etc/apparmor.d/disable/ $ sudo ln -s /etc/apparmor.d/usr.sbin.mysqld $ sudo service apparmor restart
By default MySQL/Galera node needs ports 3306 and 4567 to be open for connections from the other nodes. This can be done by something like this (allows connections from LAN):
# iptables -A INPUT -i eth0 -p tcp -m tcp --source 192.168.0.1/24 --dport 3306 -j ACCEPT # iptables -A INPUT -i eth0 -p tcp -m tcp --source 192.168.0.1/24 --dport 4567 -j ACCEPT
Substitute real values for IP address of your node and netmask.
Q: What does 'commit failed for reason: 3' mean?
A: It means that from time to time slave thread tries to apply a replicated writeset and finds a lock conflict with a local transaction (which may be in a commit phase already). For slave thread to proceed local transaction is aborted. This is a consequence of an optimistic (expecting no row conflicts) transaction execution and is expected in multi-master configuration, so you might see any of the following messages:
110906 17:45:01 [Note] WSREP: BF kill (1, seqno: 16962377), victim: (140588996478720 4) trx: 35525064 110906 17:45:01 [Note] WSREP: Aborting query: commit 110906 17:45:01 [Note] WSREP: kill trx QUERY_COMMITTING for 35525064 110906 17:45:01 [Note] WSREP: commit failed for reason: 3, seqno: -1
To avoid such conflicts there are two options:
- use the cluster in a master-slave fashion (direct all writes to a single node)
- intelligently split write load between master nodes so that potentially conflicting transactions (e.g. updating the same tables) go to to the same node. (BTW, that may improve buffer pool utilization)
Q: Is GCache some sort of "binlog"?
A: “Sort of” - maybe, but not really. GCache is a memory allocator for writesets and its primary purpose is to minimize writeset footprint in RAM. It is not a “log” of events, it is a cache. What is the difference?:
- GCache is not persistent.
- Not every entry in GCache is a writeset.
- Not every writeset in GCache will be committed.
- Writesets in GCache are not allocated in commit order.
- A writeset is a not an optimal entry for binlog since it contains extra information.
However it is possible to construct a binlog out of GCache contents.
Q: I noticed the following message in logs: "Node 0 (XXX) requested state transfer from '*any*'. Selected 1 (XXX) as donor."
A: If the state transfer donor is not set explicitly in the
wsrep_sst_donor variable, group communication module will select the donor based on its information about node states.
Group communication module monitors node state for the purpose of flow control, state transfer and quorum calculations. E.g.
JOINING node does not count towards flow control and quorum.
The node is considered to be fit for being a donor if it is in
SYNCED state. Theoretically
JOINED could do too, but at first it has to process its slave queue and then it becomes
SYNCED anyways. So the donor is chosen among
SYNCED nodes (the first there is in the index) and it immediately becomes
DONOR, i.e. not available for next requests. Full node state machine can be found here. There is no guarantee about the order of nodes in the index.
If there are no free
SYNCED nodes at the moment, the joining node will report
Requesting state transfer failed: -11(Resource temporarily unavailable). Will keep retrying every 1 second(s)
and will keep on retrying state transfer request.
Q: My nodes return "Unknown command" to every query. What has happend and how to fix it?
A: Whatever happened, your node(s) no longer think that they are a part of the primary cluster component (i.e. they suspect that there may be another primary cluster component which they have no connection to). This could be a result of network failure, crash of the half or more of the nodes, split brain, etc.
If you are certain that no other nodes of your cluster play the primary component role, you can rebootstrap the primary component by:
- choose the most updated of the nodes. This can be done by checking the output of
SHOW STATUS LIKE 'wsrep_last_committed'. Choose the node with the highest value.
SET GLOBAL wsrep_provider_options='pc.bootstrap=yes' on it.
- the component this node is part of will become a primary component, all nodes in it will synchronize to the most updated one and start accept SQL requests once again.
Q: I run OLTP sysbench benchmark against Galera cluster and get "Err1062 Duplicate entry 'XXXXXX' for key 'PRIMARY'". How to solve it?
A: Use recent version of sysbench (>= 0.5) and apply ”–oltp-auto-inc=off” option while running your sysbench benchmarks.
You can find explanation and more information in the related discussion thread.
Q: Users details (name, host, password) changes are not replicating to cluster.
A: Seems like you have tried to update mysql.user table directly to change details. If you use GRANT then it should be perfectly replicated.
Currently replication works only with InnoDB storage engine. Any writes to tables of other types, including system (mysql.*) tables are not replicated. However, DDL statements are replicated on statement level, and changes to mysql.* tables will get replicated that way. So, you can safely issue:CREATE USER… or GRANT… but issuing: INSERT INTO mysql.user… will not be replicated. In general, non-transactional engines cannot be supported in multi-master replication.
Q: Cluster stalls when running ALTER on a table which is not used.
A: Unfortunately this is a side effect of being multi-master and having several appliers: in order to deterministically detect conflicts and schedule parallel appliers we need to control when DDL ends relatively to other transactions, so effectively it needs to be executed in isolation. Galera has a 65K window tolerance when transactions can be applied in parallel, but if ALTER takes too long, the cluster has to wait.
In general there's not much that can be done at the moment. But if you can guarantee that no other session will try to modify the table AND that there are no other DDLs running, you can set wsrep_OSU_method=RSU then run the required ALTER command and reset wsrep_OSU_method=TOI, and do it on each node in turn.