Yes! It works on Amazon EC2(TM): The Return of DBT2
As you might already know, DBT2 is an OLTP benchmark similar (but not equivalent) to venerable TPC-C. A couple of months ago I tried it with Galera on Amazon EC2 to see how Galera performed in a non-LAN environment. It did quite well, especially in high availability regard. Since then Galera saw a whole lot of bugfixes and improvements, so the time has come to try it once again. This time it will be 'large' AWS instances, almost 8 gigs of RAM and two CPU cores, you know, Big Iron or something.
How would Galera perform in this setup? On one hand SMP puts it at a relative disadvantage compared to plain MySQL due to additional serialization during replication. On the other hand more RAM allows to drastically increase the warehouse count reducing the amount of certification deadlocks. And according to AWS website large instances should have better network performance.
Software Configuration
The log-bin option was turned off. It roughly halves the score and it only makes sense with master-slave replication. Since we position Galera cluster as a transparent stand-alone server replacement, there is no point in turning it on, be it single server or Galera cluster. (Between us, our MySQL patch does use its binlog functionality to create writesets. Consider it part of Galera overhead. Can't help it.)
Besides that, parameters most critical to benchmark score are the number of warehouses, the number of client connections and the following my.cnf options:
innodb_buffer_pool_size
innodb_log_file_size
innodb_flush_log_at_trx_commit
This time RAM amount permitted to use 60 warehouse database. The best NOTPM score of 7220 on plain 5.1.30 MySQL was achieved with 20 connections and
innodb_buffer_pool_size=6G
innodb_log_file_size=1G
innodb_flush_log_at_trx_commit=1
On the other hand Galera requires some memory for itself, so the parameters for Galera cluster benchmark were
innodb_buffer_pool_size=5G
innodb_log_file_size=1G
innodb_flush_log_at_trx_commit=0
While innodb_flush_log_at_trx_commit is mandatory for a stand-alone server, in Galera cluster we can turn it off. Consider it a natural Galera advantage.
The number of connections was deemed a tuning parameter - after all we use different number of nodes. So I ran benchmarks with different number of connections and chose the best score. Surprisingly this time it followed the simple formula of 12*number_of_nodes.
Results
Conns NOTPM Rollbacks(%) TRX duration(sec) Dump load(min)
-----------------------------------------------------------------------------
Plain 5.1.30: 20 ~7220 1 2.27 26
1 node : 12 ~7420 1 2.17 30
2 nodes : 24 ~9630 3 1.63 36
3 nodes : 36 ~10555 4 1.41 38
4 nodes : 48 ~10753 5 1.32 38
The following figure shows results in units of plain 5.1.30 performance (except for rollbacks percentage):

Discussion
First off, explanation for single node scoring higher than plain MySQL. It's innodb_flush_log_at_trx_commit. Plain MySQL with innodb_flush_log_at_trx_commit turned off scores 7610 NOTPM (and at 12 connections to that).
Here we can make several meaningful conclusions in comparison with previous benchmark:
- With bigger datasets Galera overhead apparently becomes negligible. Single node scores almost identically to plain MySQL. This can be explained by SELECTs becoming much heavier. Take a look at the average transaction duration - it is almost twice as long despite having 2 CPU cores. (Twice as many cores, 4 times bigger tables = twice as long transactions.) At the same time transaction changeset remained the same.
- Increasing the dataset and number of warehouses considerably improved situation with rollbacks. Now it is quite tolerable, I think.
- Despite much lower certification conflict rate, 1 to 4 node scalability is much worse. The cause for it is complex:
- One thing is much lower relative Galera overhead which
- makes single node performance so good;
- makes replication delays more prominent.
- Galera still makes bad use of SMP. Slave writesets are applied in a single thread (alas, parallel applying turned to be buggy) and in a 4-node cluster the ratio of slave to local transactions is, as you might guess 3 to 1. Even though applying slave writesets is much faster than executing the local transaction, with 3 to 1 ratio it might become a significant fraction of the work done in a single thread with a bunch of synchronization points. We must work harder here, goddamnit.
- We might be reaching the limit of our group communication throughput on Amazon network. Since we have to use TCP, it does not scale, and with 10000 NOTPM it is roughly 700 transactions per second (New Order transactions constitute about 45% of total load). Times round-trip of 0.3ms times 4 nodes... 0.84s - you can't go much faster than this.
Those conclusions (poor SMP usage, network bottleneck) are backed up by observation that in 4-node cluster CPU load as shown by
topwas roughly as follows: CPU0 30%, CPU1 10%, i.e. CPUs were very underloaded, and load was distributed very unevenly. - One thing is much lower relative Galera overhead which
- Transaction duration calculation in DBT2 is crap. It is always inversely proportional to NOTPM score regardless of the number of concurrent connections. Oh, well, Alex, go and fix it. And don't come back until its done.
Conclusions
On a LAN and/or with a bigger (300 warehouses anyone?) dataset Galera would really shine. "Designed for enterprise-scale applications", he-he.
- alex's blog
- Login or register to post comments
