We’ve come a long way, baby.
Full disclosure. I have worked for Nutanix in the performance engineering group since 2013. My opinions are likely biased, but that also gives me a decent amount of context when it comes to the performance of Nutanix storage over time. We already have a lot of customers running database workloads on Nutanix. But what about those high-performance databases still running on traditional storage?
I dug out a chart that I presented at .Next in 2017 and added to it the performance of a modern platform (AOS 6.0 and an NVME+SSD platform). For this random read microbenchmark performance has more than doubled. If you took a look at a HCI system even a few years back and decided that performance wasn’t where you needed it – there’s a good chance that the HW+SW systems shipping today could meet your needs.
Much more detail below.
Let’s take a look at another cluster which I have in the lab which has a pre-production build of AOS. It also has a couple of optane drives and an RDMA capable NIC. What are the microbenchmark results for that 4-node system?
|8KB Random Read IOPS||1,160,551 IOPS||Partially cached.|
|8KB Random Write IOPS||618,775 IOPS||Fully replicated to flash media RF2|
|1MB Sequential Write Throughput||10.64 GByte/s||Capped by wire speed. Fully replicated to flash media RF2|
|1MB Sequential Read Throughput||26.35 GByte/s||6.5 Gbyte/s per host Data locality FTW!|
|8KB Write Single IO latency||0.2 ms||200 microseconds – fully replicated to flash media RF2|
|8KB Read Single IO latency||0.16 ms||160 microseconds – from persistent media (not cached)|
The things which stand out for me here are not just the giant IOP and throughput numbers, but also the tiny latency (Response time) for a single IO. Especially the write latency of 200 microseconds, persisted to two pieces of flash on two different nodes in the cluster. The read latency of 160 microseconds is from the persistent Optane drive not the CVM cache.
The IO numbers above are taken from fio which we deploy via our X-Ray tool. But we all know that microbenchmark numbers are just “hero” numbers under ideal conditions with a single IO size and pattern for each measurement.
Customers generally care a lot more about the IO performance as seen by their databases. The database workloads are generally a mix of IO sizes, IO depth, reads+writes and a larger working-set size than the microbenchmark tests. By the way, this is true of all storage vendor IO benchmarks.
Let’s take a look at two database benchmark results with different characteristics
- A single Microsoft SQL Server running the HammerDB TPC-C like workload on single node.
- A set of Oracle databases running SLOB across a 4 node Nutanix cluster
|Single Microsoft SQL Server, single node. HammerDB TPC-C like workload||~100,000 IOPS < 1ms response time||Limited by 16 vCPU on SQL Server VM|
|8 Oracle Databases running SLOB across 4 nodes||~125,000 IOPS per node < 1ms response time||500,000 IOPS on a 4 node cluster.|
Let’s take a look at how we got here (lots of cumulative performance improvements) and take a deeper look at the database results.
Part One : Selected optimizations
Step 1. Improve performance for large working-set sizes
We started by moving a significant amount of metadata activity out of our Cassandra layer and into a RocksDB based metadata store. This greatly reduced the metadata lookup times when using large working sets, which are common with production databases. With this single change we doubled performance for two DB centric use-cases.
- Large working-set random read
- High-throughput DB loads and restores from backup
Here is the raw engineering data from experiments at the time. We call the feature Autonomous Extent Store (AES).
Step 2. Improve database log-write latency
Many databases are single threaded at the transaction-log level. With a single-threaded workload the only way to drive improvements is through reduction of latency. A scale-out commodity hardware solution like Nutanix cannot use point-point NVRAM connections like a HA-Pair storage array. A lack of low latency interconnect had limited our ability to support databases with low-latency write requirements.
We implemented RDMA passed through to the storage controller along with NVME drives to dramatically reduce write latency. In modern configurations a single write can be persisted to NAND flash across two nodes in the cluster in around 200uSec. We dramatically improved our performance for low latency workloads and paved the way for in-memory database certification.
Step 3. Improve efficiency, reduce CPU pathlength and overheads
Now that we had removed a lot of the Nutanix specific bottlenecks we focused on some low-level optimizations at the OS/Kernel level.
In 2020 we introduced Blockstore and SPDK – meaning that we now had a userland path all the way to NVME devices without need to use system calls to enter the kernel. Not only does this improve performance – it does so by increasing efficiency.
Our lab tests from that time show an Improvement of 10-20% as measured from the database itself. Since these improvements occur from efficiency improvements – there is effectively no trade-off to achieve the performance boost. Only engineering effort from the core datapath team at Nutanix!
Step 4 – Improving the lift & shift experience
Our most recent innovations for databases have been in the realm of improving performance for lift & shift DB workloads. Many customers have thousands of databases which they need to move from traditional infrastructure to a more modern foundation. Whether that is public cloud or Nutanix. Until now it has been impossible to maximize performance without some re-factoring, typically increasing parallelism by using multiple virtual disks.
With Nutanix disk-sharding we are able to drive parallelism on Nutanix storage without the admin having to re-factor anything. All the work is done by Nutanix AOS inside our storage stack.
We improved performance for single disk databases more than 2x bringing performance in-line with a database that had been hand-optimized to make use of multiple virtual disks.
Part Two – The results
With these performance optimizations in place – have we reached the inflection point where most database workloads can run on Nutanix HCI?
Microsoft SQL Server
First, the database most popular on Nutanix – MS SQL server running HammerDB (TPC-C like workload). We see ~100,000 IOPS at less than 1ms response time at the VM guest. The storage keeps all 16 cores of the SQL Server VM busy performing database workload.
Performance is consistent over many hours running the benchmark
Next, we explored a database consolidation scenario using Postgres DB. On a 4 node Nutanix cluster we scaled to 32 Postgres VMs driving in excess of 500,000 IOPS at less than 1ms response time.
Finally, we run the SLOB benchmark on Oracle DB. The Nutanix storage delivered > 500,000 IOPS at < 1ms.
In conclusion – we have come a long way in the last few years. Since around 2018 we have been on a mission to deliver database performance on commodity HCI. We are still on that journey, but we are now able to successfully run workloads that would have been impossible 2-3 years ago. Are we (Nutanix) at the inflection point? I think we probably are. Time will tell.