What's The Difference, Kenneth?
Submitted by alex on Tue, 03/13/2012 - 04:23
Uh, this is this time of the year again. You know, MySQL User Conference and Stuff. And again we need to bring some news about Galera.
What's new about Galera? Everybody knows that it kicks ass, so that's nothing new. So I've been thinking...
There have been quite a few questions about how Galera is different from NDB (MySQL Cluster). Well, how is it not different?
To begin with, NDB has 3 (three) types of nodes and insists on data partitioning, whereas Galera philosophy is conceptually the opposite: all Galera nodes are full replicas and are fully interchangeable. This makes for incomparable cluster architectures.
Secondly, NDB is a concrete database engine with its own specifics. Galera is an abstract replication library that can be used to replicate any application. In this case we have patched MySQL/InnoDB to make a multi-master Galera cluster. So it is still MySQL/InnoDB with practically unchanged semantics. So this makes another difference - InnoDB vs. NDB.
Well, people mostly care about quantitative differences and there have been a number of requests for comparative benchmarks. Ok, we'd love to see those too, however
- it would be quite hard to make an apples to apples comparison given that difference in cluster structure.
- we'd rather some third party to make such benchmarks since we're not impartial.
Yet it is better to have something to show to potential client than not. Besides there have been rumours that Galera outperforms NDB, and even that NDB does not perform that good at all. But those were rumours with little to no evidence. Even more conspicuously, I failed to find any substantial NDB benchmark results on the web. Like with Sysbench. There is a couple of posts by Mikael Ronstrom about how to speed-up Sysbench on NDB, but no actual benchmarks. Go figure, NDB is supposed to be quite a popular cool product and one can imagine there is no lack of quantitative information on its performance. This was quite a surprise for me, and paired with the aforementioned rumours it smelled of conspiracy.
So, since no one else had dared to lead us out of this ignorance, I finally decided to do my own NDB vs. Galera benchmark.
To be clear, this is going to be a layman's benchmark. Which I think is quite reasonable, given that it is not intended for NDB gurus (those know everything already). If you know how to do it better - you're welcome to do it and share your results.
First I'll have to figure out what exactly shall I compare. NDB has three node types: management, SQL and data, while Galera has just one - a node. What is NDB equivalent of a 4-node Galera cluster then?
The closest thing I could come up with is as follows: a host running both SQL (mysqld) and data (ndbmtd) nodes will constitute a single "NDB node". Management node will be one and run elsewhere (on the client machine). This is more like apples to oranges, but at least they are both fruit and round. A crash of one "NDB node" is pretty much equivalent to a crash of a Galera node. Moreover, 4 NDB data nodes mean two partitions of 2 nodes each. That means that we can kill 2 out of 4 "NDB nodes" (one from each partition) and the cluster will still remain functional. So in the end it will allow to compare 4, 3, and 2-node NDB and Galera clusters.
That will be EC2 m1.large instances (7.5Gb RAM, 2 cores). Why not? After all the cloud is mainstream now and more often than not you'll be considering running your databases in a virtual environment. (I know what you think: they recommend to use Dolphin interconnect with NDB, no less. Well, I guess you know what I think.)
The same instances were used for both NDB and Galera clusters.
I used ClusterConfigurator by Severalnines to setup NDB 7.2.4 cluster. The main parameters were:
|Cluster Usage:||1 (low write/high read)|
|Number of tables:||4096|
One thing to note here is that we have 2 partitions of 4100Mb each which gives us roughly 8Gb capacity for Sysbench tables.
Galera(InnoDB) node configuration is so simple that it is worth quoting the whole my.cnf here:
It turns out that in InnoDB a 2 million row Sysbench table takes almost exactly 500Mb. So I decided to try if 16x2M tables will fit into NDB data space. It turned out that they actually just fit: according to ClusterControl they occupied 93% of the memory space. To be on a safe side I decided to drop one table to bring data memory utilisation down to 88%. So all benchmarks were performed on 15 2M row tables.
For load balancing I used glbd running on the client machine.
Here's the Sysbench command line used for benchmarking NDB cluster (for Galera substitute "ndbcluster" for "innodb"):
./sysbench --report-interval=60 --test=tests/db/oltp.lua --oltp-tables-count=15 --oltp-table-size=2000000 --rand-init=on --oltp-read-only=off --rand-type=uniform --max-requests=0 --mysql-port=3307 --mysql-host=127.0.0.1 --mysql-user=test --mysql-password=testpass --mysql-table-engine=ndbcluster --max-time=600 --num-threads=$users run
Numbers of users tested: 4, 8, 16, 32, 64 and 128.
It is worth to note that in this rather modest setup NDB data partitioning abilities does not really help that much: In Galera node I can allocate up to 6Gb for the InnoDB buffer pool, so that ~75% of data fits in memory. With more partitions and bigger data things may be different. Or may be not - we don't know how much overhead additional partitions may incur.
The nice thing about NDB is that you don't need to warm it up. It performs at top speed right away. And it is also very consistent. You have very little variation in performance over time.
The only issue during benchmarking surfaced at 64 and 128 concurrent users where Sysbench was unable to establish all requested connections, apparently due to some race conditions. I had to edit sysbench.c to add usleep(100000); in the client thread launching loop. That cured the problem for me. (In the hindsight this issue could have been caused by the load balancer. It slipped my mind to check it with direct connections).
It is interesting that almost regardless of the cluster size
- it gets saturated at around 32 concurrent connections
- only after that it starts to scale
The latter is understandable as only at the saturation point we can see the benefit of added processing units. Scaling is actually not that bad - 50% throughput increase from 2 to 4 nodes at 128 users.
Latency chart suggests that 2-node cluster saturation started at 16 connections, however it was only mildly pronounced. However the latencies themselves are outrageous, They start at 220ms — the latency I've got for transatlantic WAN replication — and goes all the way into seconds... While up to the saturation point those latencies must be dominated by network latencies, the apparent scale-out after 32 users shows that the load becomes increasingly CPU bound. With 128 users CPU utilisation was reaching 50-60% per host (if you can trust
top in EC2), which is quite high, but not really that high as to display linear latency growth. Which makes me to suspect that NDB code must be doing some synchronous IO inside critical section(s). Of course visual inspection of the code is a preferred way to draw such conclusions, but of course I have my own code to attend to.
After Sysbench patch Galera benchmark went rather uneventful. Except that
- m1.large instance that I used as a load generator became a bottleneck when benchmarking 4-node cluster with 32 connections and above. So I had to change it to m1.xlarge midway.
- the network "weather" in the eu-west-b2 accessibility zone started to visibly deteriorate by the time I reached 3-node benchmarking. 4 connection transaction latencies started to grow in big (and almost equal) steps from ~40ms to ~68ms to ~100ms and finally to ~135ms (not connected to the number of nodes). At some point one node even partitioned off, but then quickly rejoined thanks to incremental state transfer. So when you see that single node latency is 4 times that of a 4-node cluster - it should be taken with a grain of salt. Obviously (and previous benchmarks support it) unsaturated single node has a smaller latency than a 4-node cluster.
Yet, this reflects the cloud realities, and should not invalidate the benchmark, so:
Not much to comment here except to remind about network instability in EC2. However starting at 32 users the overall picture should not be affected by it.
Notice how at 128 users (heavy saturation) transaction latency is almost proportionally reduced by adding more nodes to the cluster. The same is true for NDB cluster and confirms CPU or some other local resource contention.
Well, Galera outperforms NDB by a factor of 4 in throughput and by a factor of 2 to 4 in latencies. Now I warned you that it was a layman's benchmark and layman's results. But that just shows what lay people are in for with this software.
Summarising the results obtained here and findings from Google, here's the last and probably the most important difference between NDB and Galera:
- Unlike Galera, NDB performs not so well on commodity hardware, quite poorly in virtual environments (whatever the hardware), and must be practically unusable in WAN. This must be due to very complex communication between the nodes and resulting sensitivity to network latencies (and virtualization seems to add at least 1ms to RTT even on the high end hardware). On the contrary, replication network performance is almost never a limiting factor in Galera, due to its very simple communication needs.
- While with Sysbench OLTP workload Galera scale-out is limited to 10 nodes and a factor of 5 tops, NDB (given the right hardware) might have a higher scale-out potential due to data partitioning. But that is just a hypothesis which we don't have resources to test.
UPDATE: Henrik Ingo has justly noted that Sysbench SQL load is a rather bad match for the default NDB partitioning algorithm, which is by KEY (and which may work very nicely in some other load profile). Partitioning by RANGE might have produced much better results. However, sysbench as it is does not seem to be a good use case here at all as it is just a bunch of unrelated tables, and either partitions poorly or perfectly. Perhaps TPC-C-like load would be a more realistic benchmark here.