Important Monitoring Parameters

All wsrep-related (in this case - Galera) status variables are prefixed with 'wsrep_' so it is easy to check them with

mysql> SHOW VARIABLES LIKE 'wsrep_%';

Status variables described in chapter 3 and below are differential and reset on every SHOW STATUS command. Thus it is recommended that two SHOW STATUS commands be executed on the node with an interval of ~1 minute and then the output of the last invocation will correspond to the current moment.

1. Checking cluster integrity.

Most important check is whether the node belongs to the right cluster. It is shown by

wsrep_cluster_state_uuid

variable. It should be the same on all nodes of the cluster. The nodes with different wsrep_cluster_state_uuid values are not connected.

Next thing is to check whether the node belongs to the same component. It is indicated by

wsrep_cluster_conf_id

variable. It also should be the same on all nodes. If the nodes have different wsrep_cluster_conf_id it means that they are partitioned. It is a temporary condition and should be resolved when network connectivity between the nodes is restored.

However, in most cases a much quicker check would be

wsrep_cluster_size

If it is equal to the expected number of nodes, then all cluster nodes are connected. It is sufficient to check this variable only on one node.

Finally it is important to check the primary status of the cluster component that the node is connected to:

wsrep_cluster_status

If it is not Primary there was partition, and perhaps, a split-brain condition, and this component is currently unoperational (due to multiple membership changes and loss of quorum).

If no other node in the cluster is connected to a primary component (i.e. all nodes belong to the same component and it is non-primary) the cluster needs to me manually rebootstrapped: all nodes should be shut down and then restarted starting with the most advanced (check wsrep_last_committed status variable). Such situation is very unlikely.

If, however, there exists another cluster component and it is primary, it means loss of connectivity between the nodes and must be investigated and connectivity restored. After restoration the nodes form non-primary component will automatically reconnect and resynchronize with the primary component.

2. Checking node status.

The main status variable that reflects node health is

wsrep_ready

If it is true, the node can accept SQL load. If not,

wsrep_connected

variable should be checked. If it is OFF, the node has not yet connected to the cluster (any of its components). This may be caused by misconfiguration (invalid wsrep_cluster_address and/or wsrep_cluster_name). Check error log for proper diagnostics.

If the node is connected but wsrep_ready == OFF, the cause can be seen from

wsrep_local_state_comment

In a primary component it normally can be one of: Joining, Waiting for SST, Joined, Synced and Donor. If wsrep_ready == OFF and state comment is anything of the first three, it means that the node is still in the process of syncing with the cluster.

In a non-primary component node state comment should be Initialized. Any other states are transient and momentary.

3. Checking replication health.

The main indicator of replication health is

wsrep_flow_control_paused

status variable. Its range is from 0.0 to 1.0 and it indicates the fraction of time replication was paused since last SHOW STATUS command (thus, the value of 1.0 means complete stop). In other words it shows how much the cluster is slowed due to slave lag. This value should be as close to 0.0 as possible. The main way to improve it is increasing wsrep_slave_threads value and dropping slow nodes out of cluster.

Optimal value of wsrep_slave_threads is suggested by

wsrep_cert_deps_distance

this is how many transactions may be applied in parallel on average. There is little sense to make wsrep_slave_threads much higher than this. This value also can be quite high - in the hundreds, so common sense and discretion must be exercised when defining the value of wsrep_slave_threads.

Determining the slowest cluster node

The slowest cluster node will have the highest values in the following variables:

wsrep_flow_control_sent

and

wsrep_local_recv_queue_avg

The lower both of these values are - the better.

4. Detecting Slow Network Issues.

High (on the order of the number of client connections) values in

wsrep_local_send_queue_avg

status variable may indicate a bottleneck in the network link. If this is the case, the cause can be at any layer, from physical to OS configuration and must be investigated.

Login