How scalable is my Nutanix cluster really?

In a previous post I showed a chart which plots concurrency [X-axis] against throughput (IOPS) on the Y-Axis.  Here is that plot again:

Experienced performance chart ogglers will notice the familiar pattern of Littles Law, whereby throughput (X) rises quickly as concurrency (N) is increased.  As we follow the chart to the right, the slope flattens out and we achieve a lower increase in throughput, even as we increase concurrency by the same amount at each stage.  The flattening of the curve is best understood as Amdahls Law.

Anyone who follows Dr. Neil Gunther and his Universal Scalability Law, will also recognize this curve.

The USL states that taking the values of concurrency and throughput as inputs, we can in fact calculate the scalability of the system.  Specifically we are able to calculate the key factors of contention and crosstalk – which limit absolute linear scalability and eventually result in less throughput as additional load is submitted even as the capacity of the system is saturated.

I was fortunate to find both a very useful tool, and an easy-to-read summary of the USL from the Vivid Cortex site.  Both were written by Baron Schwartz.  I encourage anyone interested in scalability to check out his paper.

Using his Excel spreadsheet, I was able to input the numbers from my test and derive values that determine scalability.

Taking the largest number (0.074%)  the “contention value” (i.e the impact we expect due to Amdahls law) as the limit to linear scaling – we can say that for this particular cluster, running this particular (simplistic/synthetic) workload the Nutanix cluster scales 99.926% linear.  Although I did not crank up the concurrency beyond 576, the model shows us that this cluster will start to degrade performance if we try to push concurrency beyond 600 or so.  Again, the USL model is for this particular workload – on this particular cluster.  Doubling the concurrency of the offered load to 1200 will only net us 500,000 IOPS according to the model.

The high linearity (99.926%) is expected. The workload is 100% read, and with the data-locality feature of Nutanix filesystem – we expect close to 100% scalability.

We will return to these measures of scalability in the future to look at more realistic workloads.

Here is the Excel Sheet with my data : VividCortex_USL_Worksheet_v1 You are here


X-Ray scenario to demonstrate Nutanix ILM behavior.

Specifically a customer wanted to see how performance changes (and how quickly) as data moves from HDD to SSD automatically as data is accessed.  The access pattern is 100% random across the entire disk.

In a hybrid Flash/HDD system – “cold” data (i.e. data that has not been accessed for a long time) is moved from SSD to HDD when the SSD capacity is exhausted.  At some point in the future – that same data may become “hot” again, and so we want to make sure that the “newly hot” data is quickly moved back to the SSD tier.  The duration of the above chart is around 5 minutes – and we see that by, around the 3 minute mark the entire dataset is resident on the SSD tier.

This X-ray test uses a couple of neat tricks to demonstrate ILM behavior.

  • Edit container preferences to send sequential data immediately to HDD
  • Overwriting data with NUL/Zero bytes frees the underlying data on Nutanix filesystem

To demonstrate ILM from HDD to SSD (ad ultimately into the DRAM cache on the CVM) we first have to ensure that we have data on the HDD in the first place.  By default Nutanix OS will always try to write new data to SSD.  To circumvent that behavior we can edit the container preferences.  We use the fact that the “prefill” will be a sequential workload, while the measured workload will be a random workload. To make the change, use “ncli” to change the ” Sequential I/O Pri Order” to be HDD only.

In my case I happened to call my container “xray” since I didn’t want to change the default container. Now, when X-Ray executes the prefill stage, the data will land on HDD. As a second requirement, we want to see what happens when IO with different size blocks are issued so that we can get a chart similar to this: To achieve the desired behavior, we need to make sure that, at the beginning of each test, the data, again resides on HDD.  The problem is that the data is up-migrated during the test. To do this we do an initial overwrite of the entire disk with “NULL” bytes using  a parameter in fio “zero_buffers”.  This causes the data to be freed on the Nutanix filesystem.  Then we issue a normal profile with random data. Once the data is freed, then we know that the new initial writes will go to HDD – because we edited the container to do so. The overall test pattern looks like this

  • Create and clone VMs
  • Prefill with random data (Data will reside on HDD due to container edit)
  • Read disk with 16KB block size
  • Zero out the disks – to remove/free the up-migrated data
  • Prefill the disks with Radom data
  • Read disk with 32KB block size
  • Zero out the disks
  • Prefill with random daa
  • Read disk with 64KB block size

I have uploaded this x-ray test to GitHub : X-Ray Up-Migration test



Detecting and correcting hardware errors using Nutanix Filesystem.

It’s good to detect corrupted data.  It’s even better to transparently repair that data and return the correct data to the user.  Here we will demonstrate how Nutanix filesystem detects and corrects corruption.  Not all systems are made equally in this regard.  The topic of corruption detection and remedy was the focus of this excellent Usenix paper Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions. The authors find that many systems that should in theory be able to recover corrupted data do not in fact do so.

Within the guest Virtual Machine

  • Start with a Linux VM and write a specific pattern (0xdeadbeef) to /dev/sdg using fio.
  • Check that the expected data is written to the virtual disk and generate a SHA1 checksum of the entire disk.
[root@gary]# od -x /dev/sdg

0000000 adde efbe adde efbe adde efbe adde efbe
[root@gary]# sha1sum /dev/sdg

1c763488abb6e1573aa011fd36e5a3e2a09d24b0  /dev/sdg
  • The “od” command shows us that the entire 1GB disk contains the pattern 0xdeadbeef
  • The “sha1sum” command creates a checksum (digest) based on the content of the entire disk.

Within the Nutanix CVM

  • Connect to the Nutanix CVM
    • Locate one of the 4MB egroups that back this virtual disk on the node.
    • The virtual disk which belongs to the guest vm (/dev/sdg) is represented in the Nutanix cluster as a series of “Egroups” within the Nutanix filesystem.
    • Using some knowledge of the internals I can locate the Egroups which make up the vDisk seen by the guest.
    • Double check that this is indeed an Egroup belonging to my vDisk by checking that it contains the expected pattern (0xdeadbeef)
nutanix@NTNX $ od -x 10808705.egroup

0000000 adde efbe adde efbe adde efbe adde efbe
  • Now simulate a hardware failure and overwrite the egroup with null data
    • I do this by reaching underneath the cluster filesystem and deliberately creating corruption, simulating a mis-directed write somewhere in the system.
    • If the system does not correct this situation, the user VM will not read 0xdeadbeef as it expects – remember the corruption happened outside of the user VM itself.
nutanix@ $ dd if=/dev/zero of=10846352.egroup bs=1024k count=4
  • Use the “dd” command to overwrite the entire 4MB Egroup with /dev/zero (the NULL character).

Back to the client VM

  • We can tell if the correct results are returned by checking the checksums match the pre-corrupted values.
[root@gary-tpc tmp]# sha1sum /dev/sdg

1c763488abb6e1573aa011fd36e5a3e2a09d24b0  /dev/sdg. <— Same SHA1 digest as the "pre corruption" state.
  • The checksum matches the original value – showing that the data in entire the vdisk is unchanged
  • However we did change the vdisk by overwriting one of the. Egroups.
  • The system has somehow detected and repaired the corruption which I induced.
  • How?

Magic revealed

  • Nutanix keeps the checksums at an 8KB granularity as part of our distributed metadata.  The system performs the following actions
    • Detects that the checksums stored in metadata no longer match the data on disk.
      • The stored checksums match were generated against “0xdeadbeef”
      • The checksums generated during read be generated against <NULL>
      • The checksums will not match and corrective action is taken.
    • Nutanix OS
      • Finds the corresponding  un-corrupted Egroup on another node
      • Copies the uncorrupted Egroup to a new egroup on the local node
      • Fixes the metadata to point to the new fixed copy
      • Removed corrupted egroup
      • Returns the uncorrupted data to the user

Logs from the Nutanix VM

Here are the logs from Nutanix:  notice group 10846352 is the one that we deliberately corrupted earlier

E0315 13:22:37.826596 12085] Marking extent group 10846352 as corrupt reason: kSliceChecksumMismatch

I0315 13:22:37.826755 12083] vdisk_id=10808407 operation_id=387450 Starting fixer op on extent group 10846352 reason -1 reconstruction mode 0 (gflag 0)  corrupt replica autofix mode  (gflag auto)  consistency checks 0 start erasure overwrite fixer 0

I0315 13:22:37.829532 12086] vdisk_id=10808407 operation_id=387449 Not considering corrupt replica 38 of egroup 10846352
  • Data corruption can and does happen (see the above Usenix paper for some of the causes).  When designing enterprise storage we have to deal with it
  • Nutanix not only detects the corruption, it corrects it.
  • In fact Nutanix OS continually scans the data stored on the cluster and makes sure that the stored data matches the expected checksums.