Comparing RDS and Nutanix Cluster performance with HammerDB

tl;dr

In a recent experiment using Amazon RDS instance and a VM running in an on-prem Nutanix cluster, both using Skylake class processors with similar clock speeds and vCPU count. The SQLServer database on Nutanix delivered almost 2X the transaction rate as the same workload running on Amazon RDS.

It turns out that migrating an existing SQLServer VM to RDS using the same vCPU count as on-prem may yield only half the expected performance for CPU heavy database workloads. The root cause is how Amazon thinks about vCPU compared to on-prem.

Benchmark Results

Platform	CPU Type	Storage Type	vCPU	Memory	TPM
NX-8035-G6	Skylake (HT Enabled)	Nutanix AOS NVME+SSD	16	128	~900,000
RDS R5.4xlarge	Skylake (HT Enabled)	EBS io1 50K IOPS Provisioned	16	128	~500,000
RDS R5.8xlarge	Skylake (HT Enabled)	EBS io1 50K IOPS Provisioned	32	256	~800,000

Skylake Platform Comparative performance

Above are the results from running HammerDB against SQLserver 2019 on Windows 2016 on RDS and a Nutanix cluster running AHV and AOS 6.1. In both cases the HammerDB and SQLserver configuration are identical. The number of warehouses is deliberately small (1000 warehouses) which ensures that the majority of the table rows are cached in memory. Nutanix VM is twice as fast as the similarly configured RDS instance. RDS delivers ~500,000 TPM while Nutanix delivers ~900,000 TPM.

Benchmark Analysis

The RDS R5.4xlarge database instance and the SQLserver VM on the Nutanix node were both configured with 16 vCPU and 128GB of memory. Despite that, the Nutanix VM was almost twice as fast yielding ~900,000 TPM Vs ~500,000 TPM for the RDS instance. When I observed the RDS instance the bottleneck was clearly lack of CPU resources. In the below screenshot the SQL Server performance Dashboard shows 100% CPU utilization for the duration of the workload on RDS R5.4xlarge.

The same metric from the SQL VM on Nutanix shows 80-90% utilization while delivering twice as many DB transactions despite being the same CPU family (more or less same clock speed) and the same number of vCPU given to the VM.

Question #1 – Why the disparity?

The r5.4xlarge RDS instance has 16vCPU

Digging in a little further – it turns out that these are configured as 8 cores with 2 threads per core

The Nutanix VM is also configured to use 16 vCPU and the underlying physical CPU package has hyperthreading enabled. However, when there are available CPU resources on the host, the VM will have access to 16 physical cores. When there is a lot of CPU contention the VM might have to get by with 16 HT threads on 8 physical cores.

In the RDS case – the VM only ever gets access to 8 physical cores (16 vCPU/2 hyper-threads per core).

Question #2 Which is better?

You can make the argument that it is better to have consistently slow performance because you are guaranteed to always go at the same rate. But 20 years of running x86 virtualization in the datacenter has taught us that in general applications tend not to ask for resources at exactly the same time. As long as there is enough capacity in the cluster, we can let the cluser-wide scheduler figure out where our CPU heavy VMs should run such that they have the best chance of utilizing the full 16 cores.

In short on a Nutanix cluster – if there are available CPU resources the database will run as fast as the real cores can manage. In RDS the database will never go faster than whatever 8 cores can deliver.

What can I do to have RDS match the Nutanix performance?

Tell RDS to give me 1 CPU thread per vCPU?

I looked for an option for 1 thread per vCPU on RDS but was not able to find a way to do it with SQLserver. It is possible with Oracle DB on RDS but the same option seemed not to be available for SQLserver.

Use a bigger RDS Instance 32 vCPU!

The only way to give SQLserver access to more real cores is to give it more vCPU by using a larger instance – unfortunately this also increases the SQLServer licensing cost by around 2X. I changed the instance from r5.4xlarge to r5.8xlarge – so the vCPU count went from 16 to 32. I also got 2x the memory, which I didn’t need but – I did get more cores which I did need. Of course the hardware and SQL licensing costs increased dramatically.

Benchmark Result: RDS Performance with 32 vCPU

When we use a larger instance size the CPU is no longer pegged at 100% – it sits around 75%-80% busy as seen by SQLserver activity monitor. Despite doubling the vCPU the database performance still does not match the performance of the on-prem SQL VM running on Nutanix. My 16 vCPU on Nutanix produced ~900,000 TPM and my 32 vCPU RDS gave me ~800,000 TPM.

Since the CPU is no longer maxed-out, the bottleneck must have moved elsewhere

What is the new bottleneck?

We see that now the bottleneck is the time to write to the transaction log. Despite choosing an io2 volume with 50,000 IOPS limit – the RDS DB is only able to push 10K IOPS across both the TX log and the DB data files which reside on the same volume (not sure if this is tunable in RDS or not)

The total IOPS seen by RDS is around 10,000 (not 50,000) so why am I limited by the TX log if I am nowhere near the storage IOP limit? As we expect the TX log is not limited by total IOP capacity but by TX log latency (the extreme version is seen in single-threaded workloads)

The latency seen by the RDS SQL VM on EBS storage is about 2ms whereas my Nutanix transaction log is response time is around 1ms.

On my Nutanix VM the database uses a separate transaction log, database files and tempdb disks because that is the default best-practice on Nutanix. I do not believe separating transaction log, database files and tempdb files across disk is possible on RDS. Of course you could choose to use a raw EC2 instance and provision your own EBS volumes – but then you would need to pay for 3x separate RDS volumes, and the since with EBS provisioned IOPS scale with size – all of the EBS volumes would need to be large, even though transaction log sizes are often small.

n0derunner