Galera Flow Control
To ensure temporal synchrony and consistency (as opposed to logical which is provided by virtual synchrony), Galera implements several forms of low control depending on the node state:
(For the purpose of this article terms writeset cache and slave queue are interchangeable.)
OPEN & PRIMARY
The node is not considered to be a part of the cluster, it is not allowed to replicate, apply or cache any writesets. No flow control.
JOINER & DONOR
In general the node can't apply any writesets and needs to cache them. There is no meaningful way to keep it in sync with the cluster (except for stopping all replication). However it is possible to limit replication rate to make sure that writeset cache does not exceed the configured size. It is controlled by the following variables:
gcs.recv_q_hard_limitsets the maximum desired size (in bytes) of the writeset cache. The value for it depends on the amount of RAM, swap size and performance considerations. Default is
SSIZE_MAX– 2Gb on 32-bit systems and practically unlimited on 64-bit. If this limit is exceeded and
gcs.max_throttleis not 0.0, the node will abort with out-of-memory error. If
gcs.max_throttleis 0.0, replication in cluster will be stopped.
gcs.max_throttleis the minimal fraction of the normal replication rate that we can tolerate in the cluster. 1.0 means no replication rate throttling is allowed. 0.0 means complete replication stop is possible. Default is 0.25.
gcs.recv_q_soft_limitis a fraction of
gcs.recv_q_hard_limitand serves to estimate the average replication rate. When it is exceeded, average replication rate (in bytes) during this period is calculated. After that replication rate is decreased linearly with the cache size so that at
gcs.max_throttle * (average replication rate). Default value is 0.25. NOTE: the “average replication rate” estimated here may be way off from the sustained one. YMMV.
Essentially this makes writeset cache to grow semi-logarithmically with time after
gcs.recv_q_soft_limit and time is what is needed for state transfer to complete.
JOINED node can apply writesets, so for this node flow control makes sure that the node can eventually catch-up with the cluster, specifically that its writeset cache never grows. Thus the cluster-wide replication rate is limited by the rate at which the node can apply the writesets. Since applying a writeset is normally several times faster than processing a transaction, it normally does not affect the performance of the cluster except in the very beginning when the buffer pool on the node is empty. Using parallel applying can speed it up significantly.
SYNCED node flow control tries to keep slave queue to a minimum. This is controlled by following configuration variables:
gcs.fc_limit: when slave queue exceeds this limit replication is paused. It is essential that for multi-master configurations this limit is low as certification conflict rate is proportional to the slave queue length. In master-slave setups it can be considerably higher to reduce flow control intervention. Default: 16.
gcs.fc_factor: when slave queue goes below
gcs.fc_limit * gcs.fc_factorreplication is resumed. Default: 0.5.
gcs.fc_master_slavespecifies whether there are more than 1 source of replication. When it is set to NO (multi-master)
gcs.fc_limitis modified by the square root of the cluster size (that is in 4-node cluster it is two times higher than the base value). This is done to compensate for the increasing replication rate noise. Default: NO.
While it is critical for multi-master operation to have as small slave queue as possible, slave queue length is not so critical for master-slave setups, since depending on the application and hardware even 1K of writesets may be applied in a fraction of a second. Slave queue length has no effect on master-slave failover.
Note that since Galera nodes process transactions asynchronously with regards to each other, amount of replication data cannot be anticipated in any way. Hence Galera flow control is reactive, i.e. it kicks in only after certain limits are exceeded. It can't prevent exceeding these limits or make any guarantees about by how much these limits will be exceeded. For example, if
gcs.recv_q_hard_limit is set to 100Mb, it can still be exceeded by a 1Gb writeset.