Galera Flow Control

To ensure temporal synchrony and consistency (as opposed to logical which is provided by virtual synchrony), Galera implements several forms of low control depending on the node state:

(For the purpose of this article terms writeset cache and slave queue are interchangeable.)

OPEN & PRIMARY

The node is not considered to be a part of the cluster, it is not allowed to replicate, apply or cache any writesets. No flow control.

JOINER & DONOR

In general the node can't apply any writesets and needs to cache them. There is no meaningful way to keep it in sync with the cluster (except for stopping all replication). However it is possible to limit replication rate to make sure that writeset cache does not exceed the configured size. It is controlled by the following variables:

  1. gcs.recv_q_hard_limit sets the maximum desired size (in bytes) of the writeset cache. The value for it depends on the amount of RAM, swap size and performance considerations. Default is SSIZE_MAX – 2Gb on 32-bit systems and practically unlimited on 64-bit. If this limit is exceeded and gcs.max_throttle is not 0.0, the node will abort with out-of-memory error. If gcs.max_throttle is 0.0, replication in cluster will be stopped.
  2. gcs.max_throttle is the minimal fraction of the normal replication rate that we can tolerate in the cluster. 1.0 means no replication rate throttling is allowed. 0.0 means complete replication stop is possible. Default is 0.25.
  3. gcs.recv_q_soft_limit is a fraction of gcs.recv_q_hard_limit and serves to estimate the average replication rate. When it is exceeded, average replication rate (in bytes) during this period is calculated. After that replication rate is decreased linearly with the cache size so that at gcs.recv_q_hard_limit it reaches gcs.max_throttle * (average replication rate). Default value is 0.25. NOTE: the “average replication rate” estimated here may be way off from the sustained one. YMMV.

Essentially this makes writeset cache to grow semi-logarithmically with time after gcs.recv_q_soft_limit and time is what is needed for state transfer to complete.

JOINED

JOINED node can apply writesets, so for this node flow control makes sure that the node can eventually catch-up with the cluster, specifically that its writeset cache never grows. Thus the cluster-wide replication rate is limited by the rate at which the node can apply the writesets. Since applying a writeset is normally several times faster than processing a transaction, it normally does not affect the performance of the cluster except in the very beginning when the buffer pool on the node is empty. Using parallel applying can speed it up significantly.

SYNCED

In SYNCED node flow control tries to keep slave queue to a minimum. This is controlled by following configuration variables:

  1. gcs.fc_limit: when slave queue exceeds this limit replication is paused. It is essential that for multi-master configurations this limit is low as certification conflict rate is proportional to the slave queue length. In master-slave setups it can be considerably higher to reduce flow control intervention. Default: 16.
  2. gcs.fc_factor: when slave queue goes below gcs.fc_limit * gcs.fc_factor replication is resumed. Default: 0.5.
  3. gcs.fc_master_slave specifies whether there are more than 1 source of replication. When it is set to NO (multi-master) gcs.fc_limit is modified by the square root of the cluster size (that is in 4-node cluster it is two times higher than the base value). This is done to compensate for the increasing replication rate noise. Default: NO.

While it is critical for multi-master operation to have as small slave queue as possible, slave queue length is not so critical for master-slave setups, since depending on the application and hardware even 1K of writesets may be applied in a fraction of a second. Slave queue length has no effect on master-slave failover.

Note that since Galera nodes process transactions asynchronously with regards to each other, amount of replication data cannot be anticipated in any way. Hence Galera flow control is reactive, i.e. it kicks in only after certain limits are exceeded. It can't prevent exceeding these limits or make any guarantees about by how much these limits will be exceeded. For example, if gcs.recv_q_hard_limit is set to 100Mb, it can still be exceeded by a 1Gb writeset.

External Resources

Login