HCI Performance testing made easy (Part 2)

Screenshot of results in X-Ray

Today we will use the simplest workload that X-Ray provides, the “Four Corners” Benchmark.  This is the classic storage benchmark of Random Read/Write and Sequential Read/Write.  Most people understand that this workload tells us very little about how the storage will behave under real workloads, but most people also want to know “How Fast” will the storage go.

Here’s a video of the same process :

First select the Four Corners Microbenchmark” from the test list.  The “For Corners” test is supplied with X-Ray.  Of course you can edit the parameters if you wish.

Then select the target cluster to run the test upon, and add to the test queue for execution.

The results will update in realtime.  X-Ray first creates the test VM’s and powers them on…

If I want to compare different runs then X-Ray has the “Analyze” button.  In my case I am using an engineering build of the product and comparing the same platform with different tuning.  The compare/analyze can be useful for comparing different platforms, hypervisors or HCI vendors since X-Ray can run pretty-much on anything that presents a data store to vCenter as well as Nutanix AOS/Prism.

This result would seem to show that the tuning performed in experiment #2 had a large improvement in Random Write IOPS and did not negatively effect the other results (Random Read, Sequential Read, and Sequential Write).

I can also look at the particular parameters of this test, by selection the Actions->Test Logs

For instance I can look at the Random Read parameters (these are standard fio configuration files)

I can also look at the overall “Four Corners” test configuration which is specified as YAML

 

HCI Performance testing made easy (Part 1)

In this short series I will describe how to perform performance an resiliency tests on a HCI cluster using X-ray.

X-Ray can do the following for the performance tester.

  • Model IO workloads using standard fio format
  • Create VMs based on user-specified criteria (CPUs, Memory, Number & Size of disks)
  • Provision the VMs  to a HCI cluster (Nutanix AHV, ESXi, Hyper-V)
  • Execute the workloads
  • Display and store the results

In particular X-Ray give the additional benefits that most workload generators do not

  • Specify and deploy workloads with different IO patterns and characteristics
    • Most workload generators create a uniform workload on all workers
  • Execute and terminate sub-workloads on a user-specified timeline
    • e.g. Begin workload 1 then introduce workload 2 and measure the interference
  • Introduce failure scenarios and measure the impact to performance

Here’s a video of X-Ray in action, I export an existing X-Ray test, edit it to create a new test, upload and execute the test.

 

The files are in my X-ray GitHub

Storage Bus Speeds 2018

Storage bus speeds with example storage endpoints.

Bus Lanes End-Point Theoretical Bandwidth (MB/s) Note
SAS-3 1 HBA <-> Single SATA Drive 600 SAS3<->SATA 6Gbit
SAS-3 1 HBA <-> Single SAS Drive 1200 SAS3<->SAS3 12Gbit
SAS-3 4 HBA <-> SAS/SATA Fanout 4800 4 Lane HBA to Breakout (6 SSD)[2]
SAS-3 8 HBA <-> SAS/SATA Fanout 8400 8 Lane HBA to Breakout (12 SSD)[1]
PCIe-3 1 N/A 1000 Single Lane PCIe3
PCIe-3 4 PCIe <-> SAS HBA or NVMe 4000 Enough for Single NVMe
PCIe-3 8 PICe <-> SAS HBA or NVMe 8000 Enough for SAS-3 4 Lanes
PCIe-3 40 PCIe Bus <-> Processor Socket 40000 Xeon Direct conect to PCIe Bus

 

Notes


All figures here are the theoretical maximums for the busses using rough/easy calculations for bits/s<->bytes/s. Enough to figure out where the throughput bottlenecks are likely to be in a storage system.

  1. SATA devices contain a single SAS/SATA port (connection), and even when they are connected to a SAS3 HBA, the SATA protocol limits each SSD device to ~600MB/s (single port, 6Gbit)
  2. SAS devices may be dual ported (two connections to the device from the HBA(s)) – each with a 12Gbit connection giving a potential bandwidth of 2x12Gbit == 2.4Gbyte/s (roughly) per SSD device.
  3. An NVMe device directly attached to the PCIe bus has access to a bandwidth of 4GB/s by using 4 PCIe lanes – or 8GB/s using 8 PCIe lanes.  On current Xeon processors, a single socket attaches to 40 PCIe lanes directly (see diagram below) for a total bandwidth of 40GB/s per socket.

  • I first started down the road of finally coming to grips with all the different busses and lane types after reading this excellent LSI paper.  I omitted the SAS-2 figures from this article since modern systems use SAS-3 exclusively.
LSI SAS PCI Bottlenecks

 

Intel Processor & PCI connections