Comparing RDS and Nutanix Cluster performance with HammerDB

tl;dr

In a recent experiment using Amazon RDS instance and a VM running in an on-prem Nutanix cluster, both using Skylake class processors with similar clock speeds and vCPU count. The SQLServer database on Nutanix delivered almost 2X the transaction rate as the same workload running on Amazon RDS.

It turns out that migrating an existing SQLServer VM to RDS using the same vCPU count as on-prem may yield only half the expected performance for CPU heavy database workloads. The root cause is how Amazon thinks about vCPU compared to on-prem.

Benchmark Results

HammerDB results from RDS and Nutanix
Continue reading

Single threaded DB performance on Nutanix HCI

tl;dr

A Nutanix cluster can persist a replicated write across two nodes in around 250 uSec which is critical for single-threaded DB write workloads. The performance compares very well with hosted cloud database instances using the same class of processor (db.r5.4xlarge in the figure below). The metrics below are for SQL insert transactions not the underlying IO.

Single threaded commit heavy insert rates. Latency as seen from SQL insert statement.
Continue reading

Cross rack network latency in AWS

I have VMs running on bare-metal instances. Each bare-metal instance is in a separate rack by design (for fault tolerance). The bandwidth is 25GbE however, the response time between the hosts is so high that I need multiple streams to consume that bandwidth.

Compared to my local on-prem lab I need many more streams to get the observed throughput close to the theoretical bandwidth of 25GbE

# iperf StreamsAWS ThroughputOn-Prem Throughput
14.8 Gbit21.4 Gbit
29 Gbit22 Gbit
418 Gbit22.5
823 Gbit23 Gbit
Difference in throughput for a 25GbE network on-premises Vs AWS cloud (inter-rack)

Duplicate IP issues with Linux and virtual machine cloning.

TL;DR – Some modern Linux distributions use a newer method of identification which, when combined with DHCP can result in duplicate IP addresses when cloning VMs, even when the VMs have unique MAC addresses.

To resolve, do the following ( remove file, run the systemd-machine-id-setup command, reboot):

# rm /etc/machine-id
# systemd-machine-id-setup
# reboot

When hypervisor management tools make clones of virtual machines, the tools usually make sure to create a unique MAC address for every clone. Combined with DHCP, this is normally enough to boot the clones and have them receive a unique IP. Recently, when I cloned several Bitnami guest VMs which are based on Debian, I started to get duplicate IP addresses on the clones. The issue can be resolved manually by following the above procedure.

To create a VM template to clone from which will generate a new machine-id for very clone, simply create an empty /etc/machine-id file (do not rm the file, otherwise the machine-id will not be generated)

# echo "" |  tee /etc/machine-id 

The machine-id man page is a well written explanation of the implementation and motivation.

Paper: A Nine year study of filesystem and storage benchmarking

A 2007 paper, that still has lots to say on the subject of benchmarking storage and filesystems. Primarily aimed at researchers and developers, but is relevant to anyone about to embark on a benchmarking effort.

  • Use a mix of macro and micro benchmarks
  • Understand what you are testing, cached results are fine – as long as that is what you had intended.

The authors are clear on why benchmarks remain important:

Ideally, users could test performance in their own settings using real work- loads. This transfers the responsibility of benchmarking from author to user. However, this is usually impractical because testing multiple systems is time consuming, especially in that exposing the system to real workloads implies learning how to configure the system properly, possibly migrating data and other settings to the new systems, as well as dealing with their respective bugs.”

We cannot expect end-users  to be experts in benchmarking. It is out duty as experts  to provide the tools (benchmarks) that enable users to make purchasing decisions without requiring years of benchmarking expertise.

Work around for bios.hddOrder when creating an OVF/OVA template.

When changing SCSI devices in an ESX based VM, it’s easy to screw up the ability to boot.  The simple fix is to add

bios.hddOrder = “scsi0:0”

to the end of the .vmx file.  This has always worked for me.  The problem with this solution is that any OVF/OVA that is created from the VM will not include the .vmx file hack, and of course VM’s created from the template will not boot until their .vmx file is hand edited.

The solution that worked for me was to simply make the “boot drive” the first .vmdk file that is listed in the .vmx file.  In my case, the Linux OS is stored on the VMDK named “disk.vmdk”

In the before case, this disk is listed last (even though it has SCSI ID 0:0:0) and the VM does not boot.

I simply change the filename from disk_6.vmdk to disk.vmdk (and change the last item from disk.vmdk to disk_6.vmdk).

The beauty of this method is that the ordering is maintained when creating an OVF/OVA.

When the VM boots, the /dev/sd devices may change since the vmdk’s are now attached to different SCSI devices – so mounting using UUID’s in Linux helps keep things sane.

Before

:floppy0.fileName = "Floppy 0"
ide1:0.startConnected = "FALSE"
ide1:0.deviceType = "atapi-cdrom"
ide1:0.clientDevice = "TRUE"
ide1:0.fileName = "CD/DVD drive 0"
ide1:0.present = "TRUE"
scsi3:0.deviceType = "scsi-hardDisk"
scsi3:0.fileName = "disk_6.vmdk"
scsi3:0.present = "TRUE"
scsi3:1.deviceType = "scsi-hardDisk"
scsi3:1.fileName = "disk_1.vmdk"
scsi3:1.present = "TRUE"
scsi2:0.deviceType = "scsi-hardDisk"
scsi2:0.fileName = "disk_2.vmdk"
scsi2:0.present = "TRUE"
scsi2:1.deviceType = "scsi-hardDisk"
scsi2:1.fileName = "disk_3.vmdk"
scsi2:1.present = "TRUE"
scsi1:0.deviceType = "scsi-hardDisk"
scsi1:0.fileName = "disk_4.vmdk"
scsi1:0.present = "TRUE"
scsi1:1.deviceType = "scsi-hardDisk"
scsi1:1.fileName = "disk_5.vmdk"
scsi1:1.present = "TRUE"
scsi0:0.deviceType = "scsi-hardDisk"
scsi0:0.fileName = "disk.vmdk"
scsi0:0.present = "TRUE"
vmci0.pciSlotNumber = "32"
1Gethernet0.virtualDev = "vmxnet3

After

:floppy0.fileName = "Floppy 0"
ide1:0.startConnected = "FALSE"
ide1:0.deviceType = "atapi-cdrom"
ide1:0.clientDevice = "TRUE"
ide1:0.fileName = "CD/DVD drive 0"
ide1:0.present = "TRUE"
scsi3:0.deviceType = "scsi-hardDisk"
scsi3:0.fileName = "disk.vmdk"
scsi3:0.present = "TRUE"
scsi3:1.deviceType = "scsi-hardDisk"
scsi3:1.fileName = "disk_1.vmdk"
scsi3:1.present = "TRUE"
scsi2:0.deviceType = "scsi-hardDisk"
scsi2:0.fileName = "disk_2.vmdk"
scsi2:0.present = "TRUE"
scsi2:1.deviceType = "scsi-hardDisk"
scsi2:1.fileName = "disk_3.vmdk"
scsi2:1.present = "TRUE"
scsi1:0.deviceType = "scsi-hardDisk"
scsi1:0.fileName = "disk_4.vmdk"
scsi1:0.present = "TRUE"
scsi1:1.deviceType = "scsi-hardDisk"
scsi1:1.fileName = "disk_5.vmdk"
scsi1:1.present = "TRUE"
scsi0:0.deviceType = "scsi-hardDisk"
scsi0:0.fileName = "disk_6.vmdk"
scsi0:0.present = "TRUE"
vmci0.pciSlotNumber = "32"
1Gethernet0.virtualDev = "vmxnet3

Note: I tried editing the .ovf file and adding a key:value pair to the file, and regenerating the SHA1 and stashing the SHA1 in the .mf file.  The process worked, but the VM still did not boot, and the bios.hddOrder param was not in the .vmx file of the VM that was created from the template.

SATA on Nutanix. Some experimental data.

The question of  why  Nutanix uses SATA drive comes up sometimes, especially from customers who have experienced very poor performance using SATA on traditional arrays.

I can understand this anxiety.  In my time at NetApp we exclusively used SAS or FC-AL drives in performance test work.  At the time there was a huge difference in performance between SCSI and SATA.  Even a few short years ago, FC typically spun at 15K RPM whereas SATA was stuck at about a 5K RPM, so experiencing 3X the rotational delay.

These days SAS and SATA are both available in 7200 RPM configurations, and these are the type we use in standard Nutanix nodes.  In fact the SATA drives that we use are marketed by Seagate as “Nearline SAS”  or NL-SAS.   Mainly to differentiate them from the consumer grade SATA drives that are found in cheap laptops.  There are hundreds of SAS Vs SATA articles on the web, so I won’t go over the theoretical/historical arguments.

SATA in Hybrid/Tiered Storage

In a Nutanix cluster the “heavy lifting” of IO is mainly done by the SSD’s – leaving the SATA drives to service the few remaining IO’s that miss the SSD tier.  Under moderate load, the SATA spindles do pretty well, and since the SATA  $/GB is only 60% of SAS.  SATA seems like a good choice for mostly-cold data.

Let’s Experiment.

From a performance perspective,  I decided to run a few experiments to see just how well SATA performs.  In the test, the  SATA drives are Nutanix standard drives “ST91000640NS” (Seagate, priced around $150).  The comparable SAS drives are the same form-factor (2.5 Inch)  “AL13SEB900” (Toshiba, priced at about $250 USD).  These drives spin at 10K RPM.  Both drives hold around 1TB.

There are three experiments per drive type to reveal the impact of seek-times.  This is achieved using the “filesize” parameter of fio – which determines the LBA range to read.  One thing to note, is that I use a queue-depth of one.  Therefore IOPs can be calculated as simply 1/Response-Time (converted to seconds).

[global]
bs=8k
rw=randread 
iodepth=1 
ioengine=libaio 
time_based 
runtime=10 
direct=1 
filesize=1g 

[randread]
filename=/dev/sdf1 

Random Distribution. SATA Vs SAS

Working Set Size7.2K RPM SATA Response Time (ms)10K RPM SAS Response Time (ms)
1 GB5.54
100 GB7.54.5
1000 GB12.57

Zipf Distribution. SATA Only.

Working Set SizeResponse Time (ms)
1000G8.5

Somewhat intuitively as the working-set (seek) gets larger, the difference between “Real SAS” and “NL-SAS/SATA” gets wider.  This is intuitive because with a 1GB working-set,  the seek-time is close to zero, and so only the rotational delay (based on RPM) is a factor.  In fact the difference in response time is the same as the difference in rotational speed (1:1.3).

Also  (just for fun) I used the “random_distribution=zipf” function in fio to test the response time when reading across the entire range of the disk – but with a “hotspot” (zipf) rather than a uniform random read – which is pretty unrealistic.

In the “realistic” case – reading across the entire disk on the SATA drives shipped with Nutanix nodes is capable of 8.5 ms response time at 125 IOPS per spindle.

 Conclusion

The performance difference between SAS and SATA is often over-stated.  At moderate loads SATA performs well enough for most use-cases.  Even when delivering fully random IO over the entirety of the disk – SATA can deliver 8K in less than 15ms.  Using a more realistic (not 100% random) access pattern the response time is  < 10ms.

For a properly sized Nutanix implementation, the intent is to service most IO from Flash. It’s OK to generate some work on HDD from time-to-time even on SATA.