How scalable is my Nutanix cluster really?

In a previous post I showed a chart which plots concurrency [X-axis] against throughput (IOPS) on the Y-Axis.  Here is that plot again:

Experienced performance chart ogglers will notice the familiar pattern of Littles Law, whereby throughput (X) rises quickly as concurrency (N) is increased.  As we follow the chart to the right, the slope flattens out and we achieve a lower increase in throughput, even as we increase concurrency by the same amount at each stage.  The flattening of the curve is best understood as Amdahls Law.

Anyone who follows Dr. Neil Gunther and his Universal Scalability Law, will also recognize this curve.

The USL states that taking the values of concurrency and throughput as inputs, we can in fact calculate the scalability of the system.  Specifically we are able to calculate the key factors of contention and crosstalk – which limit absolute linear scalability and eventually result in less throughput as additional load is submitted even as the capacity of the system is saturated.

I was fortunate to find both a very useful tool, and an easy-to-read summary of the USL from the Vivid Cortex site.  Both were written by Baron Schwartz.  I encourage anyone interested in scalability to check out his paper.

Using his Excel spreadsheet, I was able to input the numbers from my test and derive values that determine scalability.

Taking the largest number (0.074%)  the “contention value” (i.e the impact we expect due to Amdahls law) as the limit to linear scaling – we can say that for this particular cluster, running this particular (simplistic/synthetic) workload the Nutanix cluster scales 99.926% linear.  Although I did not crank up the concurrency beyond 576, the model shows us that this cluster will start to degrade performance if we try to push concurrency beyond 600 or so.  Again, the USL model is for this particular workload – on this particular cluster.  Doubling the concurrency of the offered load to 1200 will only net us 500,000 IOPS according to the model.

The high linearity (99.926%) is expected. The workload is 100% read, and with the data-locality feature of Nutanix filesystem – we expect close to 100% scalability.

We will return to these measures of scalability in the future to look at more realistic workloads.

Here is the Excel Sheet with my data : VividCortex_USL_Worksheet_v1 You are here

Working with fio “distribution /pereto” parameter

The fio Pareto parameter allows us to create a workload, which references a very large dataset, but specify a hotspot for the access pattern.  Here’s an example using the same setup as the ILM experiment, but using a Pareto value of 0:8.  My fio file looks like this..

[global]
ioengine=libaio
direct=1
time_based
norandommap
random_distribution=pareto:0.8
The experiment shows that with the access pattern as a Pareto ratio 0:8, meaning 20% of the overall dataset is “hot” the ILM process happens much faster as the hotspot is smaller, and is identified faster than a 100% uniform random access pattern.  We would expect a similar shape for any sort of caching mechanism.

You are here. The art of HCI performance testing

At some point potential Hyper-converged infrastructure (HCI) users want to know – “How fast does this thing go?”.  The real question is “how do we measure that?”.

The simplest test is to run a single VM, with a single disk and issue a single IO at a time.  We see often see this sort of test in bake-offs, and such a test does answer an important question – “what’s the lowest possible response time I can expect from the storage”.

However, this test only gives a single data point.  Since nobody purchases a HCI cluster to run a single VM,  we also need to know what happens when multiple VMs are run at the same time.  This is a much more difficult test to conduct, and many end-users lack access and experience with tools that can give the full picture.

In the example below, the single VM, single vdisk, single IO result is at the very far left of the chart.  Since it’s impossible to read I will tell you that the result is about 2,500 IOPS at ~400 microseconds.  (in fact we know that if the IOPS are 2,500 the response time MUST be 400 microseconds 1/2,500 == .0004 seconds)

However with a single VM, the cluster is mostly idle, and has capacity to do much more work.   In this X-Ray test I add another worker VM doing the exact same workload pattern to every node in the cluster every 5 minutes.

By the time we reach the end of the test, the total IOPS have increased to around 600,000 and the response time only increased by an additional 400 microseconds.

In other words the cluster was able to achieve 240X the amount of work measured by the single VM on a single node with only a 2X increase in response time, which is still less than 1ms.

The overall result is counter-intuitive, because the rate of change in IOPS (240X) is way out of line with the increase in response time (2X).  The single VM test is using only a fraction of the cluster capacity.

When comparing  HCI clusters to traditional storage arrays – you should expect the traditional array to outperform the cluster at the far left of the chart, but as work scales up the latent capacity of the HCI cluster is able to provide huge amounts of IO with very low response times.

You can run this test yourself by adding this custom workload to X-Ray