You are here. The art of HCI performance testing

At some point potential Hyper-converged infrastructure (HCI) users want to know – “How fast does this thing go?”.  The real question is “how do we measure that?”.

The simplest test is to run a single VM, with a single disk and issue a single IO at a time.  We see often see this sort of test in bake-offs, and such a test does answer an important question – “what’s the lowest possible response time I can expect from the storage”.

However, this test only gives a single data point.  Since nobody purchases a HCI cluster to run a single VM,  we also need to know what happens when multiple VMs are run at the same time.  This is a much more difficult test to conduct, and many end-users lack access and experience with tools that can give the full picture.

In the example below, the single VM, single vdisk, single IO result is at the very far left of the chart.  Since it’s impossible to read I will tell you that the result is about 2,500 IOPS at ~400 microseconds.  (in fact we know that if the IOPS are 2,500 the response time MUST be 400 microseconds 1/2,500 == .0004 seconds)

However with a single VM, the cluster is mostly idle, and has capacity to do much more work.   In this X-Ray test I add another worker VM doing the exact same workload pattern to every node in the cluster every 5 minutes.

By the time we reach the end of the test, the total IOPS have increased to around 600,000 and the response time only increased by an additional 400 microseconds.

In other words the cluster was able to achieve 240X the amount of work measured by the single VM on a single node with only a 2X increase in response time, which is still less than 1ms.


The overall result is counter-intuitive, because the rate of change in IOPS (240X) is way out of line with the increase in response time (2X).  The single VM test is using only a fraction of the cluster capacity.

When comparing  HCI clusters to traditional storage arrays – you should expect the traditional array to outperform the cluster at the far left of the chart, but as work scales up the latent capacity of the HCI cluster is able to provide huge amounts of IO with very low response times.

You can run this test yourself by adding this custom workload to X-Ray

X-Ray scenario to demonstrate Nutanix ILM behavior.

Specifically a customer wanted to see how performance changes (and how quickly) as data moves from HDD to SSD automatically as data is accessed.  The access pattern is 100% random across the entire disk.

In a hybrid Flash/HDD system – “cold” data (i.e. data that has not been accessed for a long time) is moved from SSD to HDD when the SSD capacity is exhausted.  At some point in the future – that same data may become “hot” again, and so we want to make sure that the “newly hot” data is quickly moved back to the SSD tier.  The duration of the above chart is around 5 minutes – and we see that by, around the 3 minute mark the entire dataset is resident on the SSD tier.

This X-ray test uses a couple of neat tricks to demonstrate ILM behavior.

  • Edit container preferences to send sequential data immediately to HDD
  • Overwriting data with NUL/Zero bytes frees the underlying data on Nutanix filesystem

To demonstrate ILM from HDD to SSD (ad ultimately into the DRAM cache on the CVM) we first have to ensure that we have data on the HDD in the first place.  By default Nutanix OS will always try to write new data to SSD.  To circumvent that behavior we can edit the container preferences.  We use the fact that the “prefill” will be a sequential workload, while the measured workload will be a random workload. To make the change, use “ncli” to change the ” Sequential I/O Pri Order” to be HDD only.



In my case I happened to call my container “xray” since I didn’t want to change the default container. Now, when X-Ray executes the prefill stage, the data will land on HDD. As a second requirement, we want to see what happens when IO with different size blocks are issued so that we can get a chart similar to this: To achieve the desired behavior, we need to make sure that, at the beginning of each test, the data, again resides on HDD.  The problem is that the data is up-migrated during the test. To do this we do an initial overwrite of the entire disk with “NULL” bytes using  a parameter in fio “zero_buffers”.  This causes the data to be freed on the Nutanix filesystem.  Then we issue a normal profile with random data. Once the data is freed, then we know that the new initial writes will go to HDD – because we edited the container to do so. The overall test pattern looks like this

  • Create and clone VMs
  • Prefill with random data (Data will reside on HDD due to container edit)
  • Read disk with 16KB block size
  • Zero out the disks – to remove/free the up-migrated data
  • Prefill the disks with Radom data
  • Read disk with 32KB block size
  • Zero out the disks
  • Prefill with random daa
  • Read disk with 64KB block size

I have uploaded this x-ray test to GitHub : X-Ray Up-Migration test

 

 

HCI Performance testing made easy (Part 4)

What happens when power is lost to all nodes of a HCI Cluster?

Ever wondered what happens when all power is simultaneously lost on a HCI cluster?  One of the core principles of cloud design is that components are expected to fail, but the cluster as a whole should stay “up”.   We wanted to see what happens when all components fail at once, so we designed an X-Ray test to do exactly that.

We start an OLTP workload on every node in the cluster, then X-Ray connects to the IPMI port on each node, and powers off all the hosts while the cluster is under load.  In particular, the cluster is under read/write load (we need write workload, because we want to force the cluster to recover in-flight writes).

After power-off, we wait 10 seconds for everything to spin down, then immediately re-apply the power by connecting to the IPMI ports.

The nodes power up, and immediately start their POST (Power On Self Test) and boot the hypervisor.  The CVM will auto-start, discover the available nodes and form the cluster.

X-Ray polls the cluster manager (either Prism or vCenter) to determine that the cluster is “up” and then restarts the OLTP workload.

Our testing showed that our Nutanix cluster completed POST, and was ready to restart work in around 10 minutes.  Moreover, the time to achieve the recovery had very little variability. The chart below shows three separate runs on the same cluster.

This is the YAML file which defines the workload.  The full specification is on github.  The key part of the YAML is the nodes.PowerOff which connects to the IMPI ports of each node and vm_group.WaitForPowerOn which connects to either Nutanix Prism or vmware vCenter and determines that the cluster is formed, and ready to accept new work.

HCI Performance testing made easy (Part 3)

Creating a HCI benchmark to simulate multi-tennent workloads

 

 

HCI deployments are typically multi-tennant and often different nodes will support different types of workloads. It is very common to have large resource-hungry databases separated across nodes using anti-affinity rules.  As with traditional storage, applications are writing to a shared storage environment which is necessary to support VM movement.  It is the shared storage that often causes performance issues for data bases which are otherwise separated across nodes.  We call this the noisy neighbor problem.  A particular problem occurs when a reporting / analytical workload shares storage with a transactional workload.

In such a case we have a Bandwidth heavy workload profile (reporting) sharing with a Latency Sensitive workload (transactional)

In the past it has been difficult to measure the noisy neighbor impact without going to the trouble of configuring the entire DB stack, and finding some way to drive it.  However in X-Ray we can do exactly this sort of workload.  We supply a pre-configured scenario which we call the  DB Colocation test.

The DB Colocation test utilizes two properties of X-Ray not found in other benchmarking tools

  • Time based benchmark actions
  • Distinct per-VM workload patterns
  • Ability to provision particular workloads, to particular hosts

In our example scenario X-Ray begins by starting a workload modeled after a transactional DB (we call this the OLTP workload) on one of the nodes.  This workload runs for 60 minutes.  Then after 30 minutes X-Ray starts workloads modeled after reporting/analytical workloads on two other nodes (we call this the DSS workload).

After 30 minutes we have three independent workloads running on three independent nodes, but sharing the same storage.  The key thing to observe is the impact on the latency sensitive (OLTP) workload.  In this experiment it is the DSS workloads which are the noisy neighbor, since they will tend to utilize a lot of the storage bandwidth.  An ideal result is one where there is very little interference with the running OLTP workload, even though we expect latency to increase.  We can compare the impact on the OLTP workload by comparing the IOPS/response time during the first 30 minutes (no interference) with the remaining 60 minutes (after the DSS workloads are started).  We should expect to see some increase in response time from the OLTP application because the other nodes in the cluster have gone from idle to under-load.  The key thing to observe is whether the OLTP IOP target rate (4,000 IOPS) is achieved when the reporting workload is applied.

 

X-Ray Scenario configuration

We specify the timing rules and workloads in the test.yml file.  You can modify this to contain whichever values suit your model.  I covered editing an existing workload in Part 1.

The overall scenario begins with the OLTP workload, which will run for 3600 seconds (1 hour).  The stagger_secs value is used if there are multiple OLTP sub-workloads.  In the simple case we do use a single OLTP workload.

The scenario pauses for 1800 seconds using the test.wait specification then immediately starts the DSS workload

Finally the scenario uses the workload.Wait specification to wait for the OLTP workload to finish (approx 1 hour) before the test is deemed completed.

X-Ray Workload specification

The DB Co-Location test uses two workload profiles that aim to simulate transactional (OLTP) and reporting/analytical (DSS) workloads.  The specifications for those workloads are contained in the two .fio files (oltp.fio and dss.fio)

OLTP


The OLTP workload (oltp.fio) that we ship as  has the following characteristcs based on typical configurations that we see in the field (of course you can change these to whatever you like).

  • Target IOP rate of 4,000 IOPS
  • 4 “Data” Disks
    • 50/50 read/write ratio.
    • 90% 8KB, 10% 32KB bloc-ksize
    • 8 outstanding IO per disk
  • 2 “Log” Disks
    • 100% write
    • 90% sequential
    • 32k block-size
    • 1 outstanding IO per disk

The idea here is to simulate the two main storage workloads of a DB.  The “data” portion and the “log” portion.  Log writes are just used to commit transactions and so are 100% write.  The only time the logs are read is during DB recovery, which is not part of this scenario.  The “Data” disks are doing both reads (from DB cache misses) and writes committed transactions.  A 50/50 read/write mix might be considered too write intensive – but we wanted to stress the storage in this scenario.

DSS


The DSS workload is configured to have the following characteristics

  • Target IOP rate of 1400 IOPS
  • 4 “Data” Disks
    • 100% Read workload with 1MB blocksize
    • 10 Outstanding IOs
  • 2 “Log” Disk
    • 100% Write workload
    • 90% sequential
    •  32K block-size
    • 1 outstanding IO per disk

The idea here is to simulate a large database doing a lot of reads across a large workingset size.  The IO to the data disks is entirely read, and uses large blocks to simulate a database scanning a lot of records.  The “Log” disks have a very light workload, purely to simulate an active database which is probably updating a few tables used for housekeeping.

 

 

Detecting and correcting hardware errors using Nutanix Filesystem.

It’s good to detect corrupted data.  It’s even better to transparently repair that data and return the correct data to the user.  Here we will demonstrate how Nutanix filesystem detects and corrects corruption.  Not all systems are made equally in this regard.  The topic of corruption detection and remedy was the focus of this excellent Usenix paper Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions. The authors find that many systems that should in theory be able to recover corrupted data do not in fact do so.

Within the guest Virtual Machine

  • Start with a Linux VM and write a specific pattern (0xdeadbeef) to /dev/sdg using fio.
  • Check that the expected data is written to the virtual disk and generate a SHA1 checksum of the entire disk.
[root@gary]# od -x /dev/sdg

0000000 adde efbe adde efbe adde efbe adde efbe
*
10000000000
[root@gary]# sha1sum /dev/sdg

1c763488abb6e1573aa011fd36e5a3e2a09d24b0  /dev/sdg
  • The “od” command shows us that the entire 1GB disk contains the pattern 0xdeadbeef
  • The “sha1sum” command creates a checksum (digest) based on the content of the entire disk.

Within the Nutanix CVM

  • Connect to the Nutanix CVM
    • Locate one of the 4MB egroups that back this virtual disk on the node.
    • The virtual disk which belongs to the guest vm (/dev/sdg) is represented in the Nutanix cluster as a series of “Egroups” within the Nutanix filesystem.
    • Using some knowledge of the internals I can locate the Egroups which make up the vDisk seen by the guest.
    • Double check that this is indeed an Egroup belonging to my vDisk by checking that it contains the expected pattern (0xdeadbeef)
nutanix@NTNX $ od -x 10808705.egroup

0000000 adde efbe adde efbe adde efbe adde efbe
*
20000000
  • Now simulate a hardware failure and overwrite the egroup with null data
    • I do this by reaching underneath the cluster filesystem and deliberately creating corruption, simulating a mis-directed write somewhere in the system.
    • If the system does not correct this situation, the user VM will not read 0xdeadbeef as it expects – remember the corruption happened outside of the user VM itself.
nutanix@ $ dd if=/dev/zero of=10846352.egroup bs=1024k count=4
  • Use the “dd” command to overwrite the entire 4MB Egroup with /dev/zero (the NULL character).

Back to the client VM

  • We can tell if the correct results are returned by checking the checksums match the pre-corrupted values.
[root@gary-tpc tmp]# sha1sum /dev/sdg

1c763488abb6e1573aa011fd36e5a3e2a09d24b0  /dev/sdg. <— Same SHA1 digest as the "pre corruption" state.
  • The checksum matches the original value – showing that the data in entire the vdisk is unchanged
  • However we did change the vdisk by overwriting one of the. Egroups.
  • The system has somehow detected and repaired the corruption which I induced.
  • How?

Magic revealed

  • Nutanix keeps the checksums at an 8KB granularity as part of our distributed metadata.  The system performs the following actions
    • Detects that the checksums stored in metadata no longer match the data on disk.
      • The stored checksums match were generated against “0xdeadbeef”
      • The checksums generated during read be generated against <NULL>
      • The checksums will not match and corrective action is taken.
    • Nutanix OS
      • Finds the corresponding  un-corrupted Egroup on another node
      • Copies the uncorrupted Egroup to a new egroup on the local node
      • Fixes the metadata to point to the new fixed copy
      • Removed corrupted egroup
      • Returns the uncorrupted data to the user

Logs from the Nutanix VM

Here are the logs from Nutanix:  notice group 10846352 is the one that we deliberately corrupted earlier

E0315 13:22:37.826596 12085 disk_WAL.cc:1684] Marking extent group 10846352 as corrupt reason: kSliceChecksumMismatch

I0315 13:22:37.826755 12083 vdisk_micro_egroup_fixer_op.cc:156] vdisk_id=10808407 operation_id=387450 Starting fixer op on extent group 10846352 reason -1 reconstruction mode 0 (gflag 0)  corrupt replica autofix mode  (gflag auto)  consistency checks 0 start erasure overwrite fixer 0

I0315 13:22:37.829532 12086 vdisk_micro_egroup_fixer_op.cc:801] vdisk_id=10808407 operation_id=387449 Not considering corrupt replica 38 of egroup 10846352
  • Data corruption can and does happen (see the above Usenix paper for some of the causes).  When designing enterprise storage we have to deal with it
  • Nutanix not only detects the corruption, it corrects it.
  • In fact Nutanix OS continually scans the data stored on the cluster and makes sure that the stored data matches the expected checksums.

SQL*Server on Nutanix. Force backups to HDD.

As an experiment, I wanted to (a) Create a HDD only container, and (b) measure the bandwidth I could achieve when backing up the SQL DB.  This was performed on a standard hybrid platform with only 4 HDD’s in the node.

First create a container, but add the special options “sequential-io-priority-order=DAS-SATA random-io-priority-order=DAS-SATA” which means that all IO will be directed to the HDD only. This also means that data on this container will never be migrated up. This is just fine for a backup that will hopefully never be read, and if it is – only once, sequentially.

ncli> ctr create name=cold-only sequential-io-priority-order=DAS-SATA random-io-priority-order=DAS-SATA sp-name=all
ncli> datastore create name=cold ctr-name=cold-only

Next create a vDisk in that container – this disk will contain the SQL Server backup data

Add vdisk to the cold-only container.
Add vdisk to the cold-only container.

Format and initialize the drive.

Format the drive to hold SQL backup.
Format the drive to hold SQL backup.

Add backup targets to the drive. Adding multiple targets increases throughput because SQL Server will generate 1-2 outstanding IO’s per target. I created 16 total targets (these are just files).

SQL Backup targets

The first backup is a little slow (~64MB/s), because we’re creating the files. A second (and subsequent) backups go faster, around  120 MB/s writing directly to the HDD spindles on a single node with 4 HDDs.

Overwrite old backups

This backup stream drives around 25MB/s per HDD spindle on the Nutanix node.  On a larger platform with more spindles – we could easily drive 500MB/s, and still skip SSD by writing directly to HDD.

25MB/s per spindle

120 MB/s Each way
Backup just started. About 115MB/s read, 115MB/s write on same node.

Completed backup:

Backup complete

Lord Kelvin Vs the IO blender

One of the characteristics of a  successful storage system for virtualized environments is that it must handle the IO blender.  Put simply, when lots of regular looking workloads are virtualized and presented to the storage, their regularity is lost, and the resulting IO stream starts to look more and more random.

 This is very similar to the way that synthesisers work – they take multiple regular sine waves of varying frequencies and add them together to get a much more complex sound.

 http://msp.ucsd.edu/techniques/v0.11/book-html/node14.html#fig01.08

That’s all pretty awesome for making cool space noises, but not so much when presented to the storage OS.  Without the ability to detect regularity, things like caching, pre-fetching and any kind of predictive algorithm break down.

That pre-fetch is never going to happen.

In Nutanix NOS we treat each of these sine waves (workloads) individually, never letting them get mixed together.  NDFS knows about vmdk’s or vhdx disks – and so by keeping the regular workloads separate we can still apply all the usual techniques to keep the bits flowing, even at high loads and disparate workload mixes that cause normal storage systems to fall over in a steaming heap of cache misses and metadata chaos.

 

Designing a scaleout storage platform.

I was speaking to one of our developers the other day, and he pointed me to the following paper:  SEDA: An Architecture for Well-Conditioned, Scalable Internet Services as an example of the general philosophy behind the design of the Nutanix Distributed File System (NDFS).

Although the paper uses examples of both a webserver and a gnutella client, the philosophies are relevant to a large scale distributed filesystem.  In the case of NDFS we are serving disk blocks to clients who happen to be virtual machines.  One trade-off that is true in both cases is that scalability is traded for low latency in the single-stream case.  However at load, the response time is generally better than a system that is designed to low-latency, and then attempted to scale-up to achive high throughput.

At Nutanix we often talk about web-scale architectures, and this paper gives a pretty solid idea of what that might mean in concrete terms.

FWIW., according to google scholar, the paper has been cited 937 times, including Cassandra which is how we store filesystem meta-data in a distributed fashion.