Hyperconverged File Systems PT1 Taxonomy

One way of categorizing Hyperconverged filesystems (or any filesystem really) is by how data is distributed across the nodes, and the method used to track/retrieve that data. The following is based on knowledge of the internals of Nutanix and publicly available information for the other systems.

Metadata

Characteristics

Implemented by

Distributed Distributed data & metadata Nutanix Hash Random data distribution, hash-lookup (object store) VSAN Dedupe Data stored in HA-Pairs, Lookup by fingerprint Simplivity Dedupe Random data distribution, Lookup by fingerprint Springpath/Hyperflex Psuedo Distributed Data stored in HA pairs, Unified namespace via redirection NetApp C-Mode
    Nutanix uses a fully distributed metadata layer that allows the cluster to decide where to place data depending on the location of the VM accessing it. The data can move around to follow the VM. The Nutanix FS uses a lot of ideas from distributed systems research and implementation, rather than taking a classic filesystems approach and applying it to HCI.

Creating compressible data with fio.

binary-code-507785_1280

Today I used fio to create some compressible data to test on my Nutanix nodes.  I ended up using the following fio params to get what I wanted.

 

buffer_compress_percentage=50
refill_buffers
buffer_pattern=0xdeadbeef
  • buffer_compress_percentage does what you’d expect and specifies how compressible the data is
  • refill_buffers Is required to make the above compress percentage do what you’d expect in the large.  IOW, I want the entire file to be compressible by the buffer_compress_percentage amount
  • buffer_pattern  This is a big one.  Without setting this pattern, fio will use Null bytes to achieve compressibility, and Nutanix like many other storage vendors will suppress runs of Zero’s and so the data reduction will mostly be from zero suppression rather than from compression.

Much of this is well explained in the README for latest version of fio.

Also NOTE  Older versions of fio do not support many of the fancy data creation flags, but will not alert you to the fact that fio is ignoring them. I spent quite a bit of time wondering why my data was not compressed, until I downloaded and compiled the latest fio.

 

SATA on Nutanix. Some experimental data.

The question of  why  Nutanix uses SATA drive comes up sometimes, especially from customers who have experienced very poor performance using SATA on traditional arrays.

I can understand this anxiety.  In my time at NetApp we exclusively used SAS or FC-AL drives in performance test work.  At the time there was a huge difference in performance between SCSI and SATA.  Even a few short years ago, FC typically spun at 15K RPM whereas SATA was stuck at about a 5K RPM, so experiencing 3X the rotational delay.

These days SAS and SATA are both available in 7200 RPM configurations, and these are the type we use in standard Nutanix nodes.  In fact the SATA drives that we use are marketed by Seagate as “Nearline SAS”  or NL-SAS.   Mainly to differentiate them from the consumer grade SATA drives that are found in cheap laptops.  There are hundreds of SAS Vs SATA articles on the web, so I won’t go over the theoretical/historical arguments.

SATA in Hybrid/Tiered Storage

In a Nutanix cluster the “heavy lifting” of IO is mainly done by the SSD’s – leaving the SATA drives to service the few remaining IO’s that miss the SSD tier.  Under moderate load, the SATA spindles do pretty well, and since the SATA  $/GB is only 60% of SAS.  SATA seems like a good choice for mostly-cold data.

Let’s Experiment.

From a performance perspective,  I decided to run a few experiments to see just how well SATA performs.  In the test, the  SATA drives are Nutanix standard drives “ST91000640NS” (Seagate, priced around $150).  The comparable SAS drives are the same form-factor (2.5 Inch)  “AL13SEB900” (Toshiba, priced at about $250 USD).  These drives spin at 10K RPM.  Both drives hold around 1TB.

There are three experiments per drive type to reveal the impact of seek-times.  This is achieved using the “filesize” parameter of fio – which determines the LBA range to read.  One thing to note, is that I use a queue-depth of one.  Therefore IOPs can be calculated as simply 1/Response-Time (converted to seconds).

[global]
bs=8k
[randread]
rw=randread
iodepth=1
ioengine=libaio
time_based
runtime=10
direct=1
filesize=1g
filename=/dev/sdf1

Random Distribution. SATA Vs SAS

Working Set Size 7.2K RPM SATA Response Time (ms) 10K RPM SAS Response Time (ms)
1G 5.5 4
100G 7.5 4.5
1000G 12.5 7

Zipf Distribution SATA Only.

Working Set Size Response Time (ms)
1000G 8.5

Somewhat intuitively as the workingset (seek) gets larger, the difference between “Real SAS” and “NL-SAS/SATA” gets wider.  This is intuitive because with a 1G working-set,  the seek-time is close to zero, and so only the rotational delay (based on RPM) is a factor.  In fact the difference in response time is the same as the difference in rotational speed (1:1.3).

Also  (just for fun) I used the “random_distribution=zipf” function in fio to test the response time when reading across the entire range of the disk – but with a “hotspot” (zipf) rather than a uniform random read – which is pretty unrealistic.

In the “realistic” case – reading across the entire disk on the SATA drives shipped with Nutanix nodes is capable of 8.5 ms response time at 125 IOPS per spindle.

 Conclusion

The performance difference between SAS and SATA is often over-stated.  At moderate loads SATA performs well enough for most use-cases.  Even when delivering fully random IO over the entirety of the disk – SATA can deliver 8K in less than 15ms.  Using a more realistic (not 100% random) access pattern the response time is  < 10ms.

For a properly sized Nutanix implementation, the intent is to service most IO from Flash. It’s OK to generate some work on HDD from time-to-time even on SATA.