Nutanix Performance for Database Workloads

We’ve come a long way, baby.

Full disclosure. I have worked for Nutanix in the performance engineering group since 2013. My opinions are likely biased, but that also gives me a decent amount of context when it comes to the performance of Nutanix storage over time. We already have a lot of customers running database workloads on Nutanix. But what about those high-performance databases still running on traditional storage?

I dug out a chart that I presented at .Next in 2017 and added to it the performance of a modern platform (AOS 6.0 and an NVME+SSD platform). For this random read microbenchmark performance has more than doubled. If you took a look at a HCI system even a few years back and decided that performance wasn’t where you needed it – there’s a good chance that the HW+SW systems shipping today could meet your needs.

Much more detail below.

Let’s take a look at another cluster which I have in the lab which has a pre-production build of AOS. It also has a couple of optane drives and an RDMA capable NIC. What are the microbenchmark results for that 4-node system?

WorkloadResultNote
8KB Random Read IOPS1,160,551 IOPSPartially cached.
8KB Random Write IOPS618,775 IOPSFully replicated to flash media RF2
1MB Sequential Write Throughput10.64 GByte/sCapped by wire speed. Fully replicated to flash media RF2
1MB Sequential Read Throughput26.35 GByte/s6.5 Gbyte/s per host Data locality FTW!
8KB Write Single IO latency0.2 ms200 microseconds – fully replicated to flash media RF2
8KB Read Single IO latency0.16 ms160 microseconds – from persistent media (not cached)
microbenchmark hero numbers

The things which stand out for me here are not just the giant IOP and throughput numbers, but also the tiny latency (Response time) for a single IO. Especially the write latency of 200 microseconds, persisted to two pieces of flash on two different nodes in the cluster. The read latency of 160 microseconds is from the persistent Optane drive not the CVM cache.

Database Benchmarks

The IO numbers above are taken from fio which we deploy via our X-Ray tool. But we all know that microbenchmark numbers are just “hero” numbers under ideal conditions with a single IO size and pattern for each measurement.

Customers generally care a lot more about the IO performance as seen by their databases. The database workloads are generally a mix of IO sizes, IO depth, reads+writes and a larger working-set size than the microbenchmark tests. By the way, this is true of all storage vendor IO benchmarks.

Let’s take a look at two database benchmark results with different characteristics

  1. A single Microsoft SQL Server running the HammerDB TPC-C like workload on single node.
  2. A set of Oracle databases running SLOB across a 4 node Nutanix cluster
WorkloadResultNote
Single Microsoft SQL Server, single node. HammerDB TPC-C like workload~100,000 IOPS < 1ms response timeLimited by 16 vCPU on SQL Server VM
8 Oracle Databases running SLOB across 4 nodes~125,000 IOPS per node < 1ms response time500,000 IOPS on a 4 node cluster.
IOPS driven by Database query engines (SQL Server, Oracle)

Let’s take a look at how we got here (lots of cumulative performance improvements) and take a deeper look at the database results.

Part One : Selected optimizations

Step 1. Improve performance for large working-set sizes

We started by moving a significant amount of metadata activity out of our Cassandra layer and into a RocksDB based metadata store.   This greatly reduced the metadata lookup times when using large working sets, which are common with production databases. With this single change we doubled performance for two DB centric use-cases. 

  • Large working-set random read
  • High-throughput DB loads and restores from backup

Here is the raw engineering data from experiments at the time.  We call the feature Autonomous Extent Store (AES).

AES Improvement for data load
AES Improvement for large workingset random read

Step 2. Improve database log-write latency

Many databases are single threaded at the transaction-log level. With a single-threaded workload the only way to drive improvements is through reduction of latency.  A scale-out commodity hardware solution like Nutanix cannot use point-point NVRAM connections like a HA-Pair storage array.  A lack of low latency interconnect had limited our ability to support databases with low-latency write requirements. 

We implemented RDMA passed through to the storage controller along with NVME drives to dramatically reduce write latency.  In modern configurations a single write can be persisted to NAND flash across two nodes in the cluster in around 200uSec. We dramatically improved our performance for low latency workloads and paved the way for in-memory database certification.

RDMA Improvement for low latency write

Step 3. Improve efficiency, reduce CPU pathlength and overheads

Now that we had removed a lot of the Nutanix specific bottlenecks we focused on some low-level optimizations at the OS/Kernel level.

In 2020 we introduced Blockstore and SPDK – meaning that we now had a userland path all the way to NVME devices without need to use system calls to enter the kernel.   Not only does this improve performance – it does so by increasing efficiency. 

Our lab tests from that time show an Improvement of 10-20% as measured from the database itself.  Since these improvements occur from efficiency improvements – there is effectively no trade-off to achieve the performance boost.  Only engineering effort from the core datapath team at Nutanix!

Improvement in IOP performance
Improvement in latency

Step 4 – Improving the lift & shift experience

Our most recent innovations for databases have been in the realm of improving performance for lift & shift DB workloads.  Many customers have thousands of databases which they need to move from traditional infrastructure to a more modern foundation.  Whether that is public cloud or Nutanix.  Until now it has been impossible to maximize performance without some re-factoring, typically increasing parallelism by using multiple virtual disks.

With Nutanix disk-sharding we are able to drive parallelism on Nutanix storage without the admin having to re-factor anything.  All the work is done by Nutanix AOS inside our storage stack.

We improved performance for single disk databases more than 2x bringing performance in-line with a database that had been hand-optimized to make use of multiple virtual disks.

IO Improvement for Single Disk Database VMs

Part Two – The results

With these performance optimizations in place – have we reached the inflection point where most database workloads can run on Nutanix HCI? 

Microsoft SQL Server

First, the database most popular on Nutanix – MS SQL server running HammerDB (TPC-C like workload).  We see ~100,000 IOPS at less than 1ms response time at the VM guest.  The storage keeps all 16 cores of the SQL Server VM busy performing database workload.

Windows Resmon output running HammerDB

Performance is consistent over many hours running the benchmark

Output of Prism 2 hours into a 4 hour benchmark run.

PostgreSQL

Next, we explored a database consolidation scenario using Postgres DB.  On a 4 node Nutanix cluster we scaled to 32 Postgres VMs driving in excess of 500,000 IOPS at less than 1ms response time.

Multiple Postgres Databases running PGbench on a 4 node cluster.

Oracle

Finally, we run the SLOB benchmark on Oracle DB.  The Nutanix storage delivered > 500,000 IOPS at < 1ms.

Conclusion

In conclusion – we have come a long way in the last few years.  Since around 2018 we have been on a mission to deliver database performance on commodity HCI.  We are still on that journey, but we are now able to successfully run workloads that would have been impossible 2-3 years ago. Are we (Nutanix) at the inflection point? I think we probably are. Time will tell.

Using rwmixread and rate_iops in fio

Creating a mixed read/write workload with fio can be a bit confusing. Assume we want to create a fixed rate workload of 100 IOPS split 70:30 between reads and writes.

Don’t mix rwmixread and rate_iops
TL;DR

Specify the rate directly with rate_iops=<read-rate>,<write-rate> do not try to use rwmixread with rate_iops. For the example above use.

rate_iops=70,30 

Additionally older versions of fio exhibit problems when using rate_poisson with rate_iops . fio version 3.7 that I was using did not exhibit the problem.

Continue reading

Cross rack network latency in AWS

I have VMs running on bare-metal instances. Each bare-metal instance is in a separate rack by design (for fault tolerance). The bandwidth is 25GbE however, the response time between the hosts is so high that I need multiple streams to consume that bandwidth.

Compared to my local on-prem lab I need many more streams to get the observed throughput close to the theoretical bandwidth of 25GbE

# iperf StreamsAWS ThroughputOn-Prem Throughput
14.8 Gbit21.4 Gbit
29 Gbit22 Gbit
418 Gbit22.5
823 Gbit23 Gbit
Difference in throughput for a 25GbE network on-premises Vs AWS cloud (inter-rack)

How to measure database scaling & density on Nutanix HCI platform.

How can database density be measured?

  • How does database performance behave as more DBs are consolidated?
  • What impact does running the CVM have on available host resources?

tl;dr

  • The cluster was able to achieve ~90% of the theoretical maximum.
  • CVM overhead was 5% for this workload.

Experiment setup

The goal was to establish how database performance is affected as additional database workloads are added into the cluster. As a secondary metric, measure the overhead from running the virtual storage controller on the same host as the database servers themselves. We use the Postgres database with pgbench workload and measure the total transactions per second.

Cluster configuration

  • 4 Node Nutanix cluster, with 2x Xeon CPU’s per host with 20 cores per socket.

Database configuration

Each database is identically configured with

  • Postgres 9.3
  • Ubuntu Linux
  • 4 vCPU
  • 8GB of memory
  • pgbench benchmark, running the “simple” query set.

The database is sized so that it fits entirely in memory. This is a test of CPU/Memory not IO.

Experiment steps.

The experiment starts with a single Database on a single host. We add more databases into the cluster until we reach 40 databases in total. At 40 databases with 4 vCPU each and a CPU bound workload we use all 160 CPU cores on the cluster.

The database is configured to fit into the host DRAM memory, and the benchmark runs as fast as it can – the benchmark is CPU bound.

Results

Below are the measured results from running 1-40 databases on the 4 node cluster.

Performance scales almost linearly from 4 to 160 CPU with no obvious bottlenecks before all of the CPU cores are saturated in the host at 40 databases.

Scaling from 1 Databases to 40 on a 4 node cluster.
Continue reading

How to run vdbench benchmark on any HCI with X-Ray

Many storage performance testers are familiar with vdbench, and wish to use it to test Hyper-Converged (HCI) performance. To accurately performance test HCI you need to deploy workloads on all HCI nodes. However, deploying multiple VMs and coordinating vdbench can be tricky, so with X-ray we provide an easy way to run vdbench at scale. Here’s how to do it.

Continue reading