HCI Performance testing made easy (Part 1)

In this short series I will describe how to perform performance an resiliency tests on a HCI cluster using X-ray.

X-Ray can do the following for the performance tester.

  • Model IO workloads using standard fio format
  • Create VMs based on user-specified criteria (CPUs, Memory, Number & Size of disks)
  • Provision the VMs  to a HCI cluster (Nutanix AHV, ESXi, Hyper-V)
  • Execute the workloads
  • Display and store the results

In particular X-Ray give the additional benefits that most workload generators do not

  • Specify and deploy workloads with different IO patterns and characteristics
    • Most workload generators create a uniform workload on all workers
  • Execute and terminate sub-workloads on a user-specified timeline
    • e.g. Begin workload 1 then introduce workload 2 and measure the interference
  • Introduce failure scenarios and measure the impact to performance

Here’s a video of X-Ray in action, I export an existing X-Ray test, edit it to create a new test, upload and execute the test.

 

The files are in my X-ray GitHub

Storage Bus Speeds 2018

Storage bus speeds with example storage endpoints.

Bus Lanes End-Point Theoretical Bandwidth (MB/s) Note
SAS-3 1 HBA <-> Single SATA Drive 600 SAS3<->SATA 6Gbit
SAS-3 1 HBA <-> Single SAS Drive 1200 SAS3<->SAS3 12Gbit
SAS-3 4 HBA <-> SAS/SATA Fanout 4800 4 Lane HBA to Breakout (6 SSD)[2]
SAS-3 8 HBA <-> SAS/SATA Fanout 8400 8 Lane HBA to Breakout (12 SSD)[1]
PCIe-3 1 N/A 1000 Single Lane PCIe3
PCIe-3 4 PCIe <-> SAS HBA or NVMe 4000 Enough for Single NVMe
PCIe-3 8 PICe <-> SAS HBA or NVMe 8000 Enough for SAS-3 4 Lanes
PCIe-3 40 PCIe Bus <-> Processor Socket 40000 Xeon Direct conect to PCIe Bus

 

Notes


All figures here are the theoretical maximums for the busses using rough/easy calculations for bits/s<->bytes/s. Enough to figure out where the throughput bottlenecks are likely to be in a storage system.

  1. SATA devices contain a single SAS/SATA port (connection), and even when they are connected to a SAS3 HBA, the SATA protocol limits each SSD device to ~600MB/s (single port, 6Gbit)
  2. SAS devices may be dual ported (two connections to the device from the HBA(s)) – each with a 12Gbit connection giving a potential bandwidth of 2x12Gbit == 2.4Gbyte/s (roughly) per SSD device.
  3. An NVMe device directly attached to the PCIe bus has access to a bandwidth of 4GB/s by using 4 PCIe lanes – or 8GB/s using 8 PCIe lanes.  On current Xeon processors, a single socket attaches to 40 PCIe lanes directly (see diagram below) for a total bandwidth of 40GB/s per socket.

  • I first started down the road of finally coming to grips with all the different busses and lane types after reading this excellent LSI paper.  I omitted the SAS-2 figures from this article since modern systems use SAS-3 exclusively.
LSI SAS PCI Bottlenecks

 

Intel Processor & PCI connections



Detecting and correcting hardware errors using Nutanix Filesystem.

It’s good to detect corrupted data.  It’s even better to transparently repair that data and return the correct data to the user.  Here we will demonstrate how Nutanix filesystem detects and corrects corruption.  Not all systems are made equally in this regard.  The topic of corruption detection and remedy was the focus of this excellent Usenix paper Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions. The authors find that many systems that should in theory be able to recover corrupted data do not in fact do so.

Within the guest Virtual Machine

  • Start with a Linux VM and write a specific pattern (0xdeadbeef) to /dev/sdg using fio.
  • Check that the expected data is written to the virtual disk and generate a SHA1 checksum of the entire disk.
[root@gary]# od -x /dev/sdg

0000000 adde efbe adde efbe adde efbe adde efbe
*
10000000000
[root@gary]# sha1sum /dev/sdg

1c763488abb6e1573aa011fd36e5a3e2a09d24b0  /dev/sdg
  • The “od” command shows us that the entire 1GB disk contains the pattern 0xdeadbeef
  • The “sha1sum” command creates a checksum (digest) based on the content of the entire disk.

Within the Nutanix CVM

  • Connect to the Nutanix CVM
    • Locate one of the 4MB egroups that back this virtual disk on the node.
    • The virtual disk which belongs to the guest vm (/dev/sdg) is represented in the Nutanix cluster as a series of “Egroups” within the Nutanix filesystem.
    • Using some knowledge of the internals I can locate the Egroups which make up the vDisk seen by the guest.
    • Double check that this is indeed an Egroup belonging to my vDisk by checking that it contains the expected pattern (0xdeadbeef)
nutanix@NTNX $ od -x 10808705.egroup

0000000 adde efbe adde efbe adde efbe adde efbe
*
20000000
  • Now simulate a hardware failure and overwrite the egroup with null data
    • I do this by reaching underneath the cluster filesystem and deliberately creating corruption, simulating a mis-directed write somewhere in the system.
    • If the system does not correct this situation, the user VM will not read 0xdeadbeef as it expects – remember the corruption happened outside of the user VM itself.
nutanix@ $ dd if=/dev/zero of=10846352.egroup bs=1024k count=4
  • Use the “dd” command to overwrite the entire 4MB Egroup with /dev/zero (the NULL character).

Back to the client VM

  • We can tell if the correct results are returned by checking the checksums match the pre-corrupted values.
[root@gary-tpc tmp]# sha1sum /dev/sdg

1c763488abb6e1573aa011fd36e5a3e2a09d24b0  /dev/sdg. <— Same SHA1 digest as the "pre corruption" state.
  • The checksum matches the original value – showing that the data in entire the vdisk is unchanged
  • However we did change the vdisk by overwriting one of the. Egroups.
  • The system has somehow detected and repaired the corruption which I induced.
  • How?

Magic revealed

  • Nutanix keeps the checksums at an 8KB granularity as part of our distributed metadata.  The system performs the following actions
    • Detects that the checksums stored in metadata no longer match the data on disk.
      • The stored checksums match were generated against “0xdeadbeef”
      • The checksums generated during read be generated against <NULL>
      • The checksums will not match and corrective action is taken.
    • Nutanix OS
      • Finds the corresponding  un-corrupted Egroup on another node
      • Copies the uncorrupted Egroup to a new egroup on the local node
      • Fixes the metadata to point to the new fixed copy
      • Removed corrupted egroup
      • Returns the uncorrupted data to the user

Logs from the Nutanix VM

Here are the logs from Nutanix:  notice group 10846352 is the one that we deliberately corrupted earlier

E0315 13:22:37.826596 12085 disk_WAL.cc:1684] Marking extent group 10846352 as corrupt reason: kSliceChecksumMismatch

I0315 13:22:37.826755 12083 vdisk_micro_egroup_fixer_op.cc:156] vdisk_id=10808407 operation_id=387450 Starting fixer op on extent group 10846352 reason -1 reconstruction mode 0 (gflag 0)  corrupt replica autofix mode  (gflag auto)  consistency checks 0 start erasure overwrite fixer 0

I0315 13:22:37.829532 12086 vdisk_micro_egroup_fixer_op.cc:801] vdisk_id=10808407 operation_id=387449 Not considering corrupt replica 38 of egroup 10846352
  • Data corruption can and does happen (see the above Usenix paper for some of the causes).  When designing enterprise storage we have to deal with it
  • Nutanix not only detects the corruption, it corrects it.
  • In fact Nutanix OS continually scans the data stored on the cluster and makes sure that the stored data matches the expected checksums.