Table of Contents
Important Monitoring Parameters
All wsrep-related (in this case - Galera) status variables are prefixed with 'wsrep_' so it is easy to check them with
mysql> SHOW VARIABLES LIKE 'wsrep_%';
Status variables described in chapter 3 and below are differential and reset on every
SHOW STATUS command. Thus it is recommended that two
SHOW STATUS commands be executed on the node with an interval of ~1 minute and then the output of the last invocation will correspond to the current moment.
1. Checking cluster integrity.
Most important check is whether the node belongs to the right cluster. It is shown by
variable. It should be the same on all nodes of the cluster. The nodes with different
wsrep_cluster_state_uuid values are not connected.
Next thing is to check whether the node belongs to the same component. It is indicated by
variable. It also should be the same on all nodes. If the nodes have different
wsrep_cluster_conf_id it means that they are partitioned. It is a temporary condition and should be resolved when network connectivity between the nodes is restored.
However, in most cases a much quicker check would be
If it is equal to the expected number of nodes, then all cluster nodes are connected. It is sufficient to check this variable only on one node.
Finally it is important to check the primary status of the cluster component that the node is connected to:
If it is not
Primary there was partition, and perhaps, a split-brain condition, and this component is currently unoperational (due to multiple membership changes and loss of quorum).
If no other node in the cluster is connected to a primary component (i.e. all nodes belong to the same component and it is non-primary) the cluster needs to me manually rebootstrapped: all nodes should be shut down and then restarted starting with the most advanced (check
wsrep_last_committed status variable). Such situation is very unlikely.
If, however, there exists another cluster component and it is primary, it means loss of connectivity between the nodes and must be investigated and connectivity restored. After restoration the nodes form non-primary component will automatically reconnect and resynchronize with the primary component.
2. Checking node status.
The main status variable that reflects node health is
If it is true, the node can accept SQL load. If not,
variable should be checked. If it is OFF, the node has not yet connected to the cluster (any of its components). This may be caused by misconfiguration (invalid
wsrep_cluster_name). Check error log for proper diagnostics.
If the node is connected but
wsrep_ready == OFF, the cause can be seen from
In a primary component it normally can be one of:
Waiting for SST,
wsrep_ready == OFF and state comment is anything of the first three, it means that the node is still in the process of syncing with the cluster.
In a non-primary component node state comment should be
Initialized. Any other states are transient and momentary.
3. Checking replication health.
The main indicator of replication health is
status variable. Its range is from 0.0 to 1.0 and it indicates the fraction of time replication was paused since last
SHOW STATUS command (thus, the value of 1.0 means complete stop). In other words it shows how much the cluster is slowed due to slave lag. This value should be as close to 0.0 as possible. The main way to improve it is increasing
wsrep_slave_threads value and dropping slow nodes out of cluster.
Optimal value of
wsrep_slave_threads is suggested by
this is how many transactions may be applied in parallel on average. There is little sense to make
wsrep_slave_threads much higher than this. This value also can be quite high - in the hundreds, so common sense and discretion must be exercised when defining the value of
Determining the slowest cluster node
The slowest cluster node will have the highest values in the following variables:
The lower both of these values are - the better.
4. Detecting Slow Network Issues.
High (on the order of the number of client connections) values in
status variable may indicate a bottleneck in the network link. If this is the case, the cause can be at any layer, from physical to OS configuration and must be investigated.