Work around for bios.hddOrder when creating an OVF/OVA template.

When changing SCSI devices in an ESX based VM, it’s easy to screw up the ability to boot.  The simple fix is to add

bios.hddOrder = “scsi0:0”

to the end of the .vmx file.  This has always worked for me.  The problem with this solution is that any OVF/OVA that is created from the VM will not include the .vmx file hack, and of course VM’s created from the template will not boot until their .vmx file is hand edited.

The solution that worked for me was to simply make the “boot drive” the first .vmdk file that is listed in the .vmx file.  In my case, the Linux OS is stored on the VMDK named “disk.vmdk”

In the before case, this disk is listed last (even though it has SCSI ID 0:0:0) and the VM does not boot.

I simply change the filename from disk_6.vmdk to disk.vmdk (and change the last item from disk.vmdk to disk_6.vmdk).

The beauty of this method is that the ordering is maintained when creating an OVF/OVA.

When the VM boots, the /dev/sd devices may change since the vmdk’s are now attached to different SCSI devices – so mounting using UUID’s in Linux helps keep things sane.

Before

:floppy0.fileName = "Floppy 0"
ide1:0.startConnected = "FALSE"
ide1:0.deviceType = "atapi-cdrom"
ide1:0.clientDevice = "TRUE"
ide1:0.fileName = "CD/DVD drive 0"
ide1:0.present = "TRUE"
scsi3:0.deviceType = "scsi-hardDisk"
scsi3:0.fileName = "disk_6.vmdk"
scsi3:0.present = "TRUE"
scsi3:1.deviceType = "scsi-hardDisk"
scsi3:1.fileName = "disk_1.vmdk"
scsi3:1.present = "TRUE"
scsi2:0.deviceType = "scsi-hardDisk"
scsi2:0.fileName = "disk_2.vmdk"
scsi2:0.present = "TRUE"
scsi2:1.deviceType = "scsi-hardDisk"
scsi2:1.fileName = "disk_3.vmdk"
scsi2:1.present = "TRUE"
scsi1:0.deviceType = "scsi-hardDisk"
scsi1:0.fileName = "disk_4.vmdk"
scsi1:0.present = "TRUE"
scsi1:1.deviceType = "scsi-hardDisk"
scsi1:1.fileName = "disk_5.vmdk"
scsi1:1.present = "TRUE"
scsi0:0.deviceType = "scsi-hardDisk"
scsi0:0.fileName = "disk.vmdk"
scsi0:0.present = "TRUE"
vmci0.pciSlotNumber = "32"
1Gethernet0.virtualDev = "vmxnet3

After

:floppy0.fileName = "Floppy 0"
ide1:0.startConnected = "FALSE"
ide1:0.deviceType = "atapi-cdrom"
ide1:0.clientDevice = "TRUE"
ide1:0.fileName = "CD/DVD drive 0"
ide1:0.present = "TRUE"
scsi3:0.deviceType = "scsi-hardDisk"
scsi3:0.fileName = "disk.vmdk"
scsi3:0.present = "TRUE"
scsi3:1.deviceType = "scsi-hardDisk"
scsi3:1.fileName = "disk_1.vmdk"
scsi3:1.present = "TRUE"
scsi2:0.deviceType = "scsi-hardDisk"
scsi2:0.fileName = "disk_2.vmdk"
scsi2:0.present = "TRUE"
scsi2:1.deviceType = "scsi-hardDisk"
scsi2:1.fileName = "disk_3.vmdk"
scsi2:1.present = "TRUE"
scsi1:0.deviceType = "scsi-hardDisk"
scsi1:0.fileName = "disk_4.vmdk"
scsi1:0.present = "TRUE"
scsi1:1.deviceType = "scsi-hardDisk"
scsi1:1.fileName = "disk_5.vmdk"
scsi1:1.present = "TRUE"
scsi0:0.deviceType = "scsi-hardDisk"
scsi0:0.fileName = "disk_6.vmdk"
scsi0:0.present = "TRUE"
vmci0.pciSlotNumber = "32"
1Gethernet0.virtualDev = "vmxnet3

Note: I tried editing the .ovf file and adding a key:value pair to the file, and regenerating the SHA1 and stashing the SHA1 in the .mf file.  The process worked, but the VM still did not boot, and the bios.hddOrder param was not in the .vmx file of the VM that was created from the template.

SATA on Nutanix. Some experimental data.

The question of  why  Nutanix uses SATA drive comes up sometimes, especially from customers who have experienced very poor performance using SATA on traditional arrays.

I can understand this anxiety.  In my time at NetApp we exclusively used SAS or FC-AL drives in performance test work.  At the time there was a huge difference in performance between SCSI and SATA.  Even a few short years ago, FC typically spun at 15K RPM whereas SATA was stuck at about a 5K RPM, so experiencing 3X the rotational delay.

These days SAS and SATA are both available in 7200 RPM configurations, and these are the type we use in standard Nutanix nodes.  In fact the SATA drives that we use are marketed by Seagate as “Nearline SAS”  or NL-SAS.   Mainly to differentiate them from the consumer grade SATA drives that are found in cheap laptops.  There are hundreds of SAS Vs SATA articles on the web, so I won’t go over the theoretical/historical arguments.

SATA in Hybrid/Tiered Storage

In a Nutanix cluster the “heavy lifting” of IO is mainly done by the SSD’s – leaving the SATA drives to service the few remaining IO’s that miss the SSD tier.  Under moderate load, the SATA spindles do pretty well, and since the SATA  $/GB is only 60% of SAS.  SATA seems like a good choice for mostly-cold data.

Let’s Experiment.

From a performance perspective,  I decided to run a few experiments to see just how well SATA performs.  In the test, the  SATA drives are Nutanix standard drives “ST91000640NS” (Seagate, priced around $150).  The comparable SAS drives are the same form-factor (2.5 Inch)  “AL13SEB900” (Toshiba, priced at about $250 USD).  These drives spin at 10K RPM.  Both drives hold around 1TB.

There are three experiments per drive type to reveal the impact of seek-times.  This is achieved using the “filesize” parameter of fio – which determines the LBA range to read.  One thing to note, is that I use a queue-depth of one.  Therefore IOPs can be calculated as simply 1/Response-Time (converted to seconds).

[global]
bs=8k
[randread]
rw=randread
iodepth=1
ioengine=libaio
time_based
runtime=10
direct=1
filesize=1g
filename=/dev/sdf1

Random Distribution. SATA Vs SAS

Working Set Size 7.2K RPM SATA Response Time (ms) 10K RPM SAS Response Time (ms)
1G 5.5 4
100G 7.5 4.5
1000G 12.5 7

Zipf Distribution SATA Only.

Working Set Size Response Time (ms)
1000G 8.5

Somewhat intuitively as the workingset (seek) gets larger, the difference between “Real SAS” and “NL-SAS/SATA” gets wider.  This is intuitive because with a 1G working-set,  the seek-time is close to zero, and so only the rotational delay (based on RPM) is a factor.  In fact the difference in response time is the same as the difference in rotational speed (1:1.3).

Also  (just for fun) I used the “random_distribution=zipf” function in fio to test the response time when reading across the entire range of the disk – but with a “hotspot” (zipf) rather than a uniform random read – which is pretty unrealistic.

In the “realistic” case – reading across the entire disk on the SATA drives shipped with Nutanix nodes is capable of 8.5 ms response time at 125 IOPS per spindle.

 Conclusion

The performance difference between SAS and SATA is often over-stated.  At moderate loads SATA performs well enough for most use-cases.  Even when delivering fully random IO over the entirety of the disk – SATA can deliver 8K in less than 15ms.  Using a more realistic (not 100% random) access pattern the response time is  < 10ms.

For a properly sized Nutanix implementation, the intent is to service most IO from Flash. It’s OK to generate some work on HDD from time-to-time even on SATA.

Impact of Paravirtual SCSI driver VS LSI Emulation with Data.

TL;DR  Comparison of Paravirtual SCSI Vs Emulated SCSI in with measurements.  PVSCSI gives measurably better response times at high load. 

During a performance debugging session, I noticed that the response time on two of the SCSI devices was much higher than the others (Linux host under vmware ESX).  The difference was unexpected since all the devices were part of the same stripe doing a uniform synthetic workload.

iostat output from the system under investigation.

 

The immediate observation is that queue length is higher, as is wait time.  All these devices reside on the same back-end storage so I am looking for something else.  When I traced back the devices it turned out that the “slow devices” were attached to LSI emulated controllers in ESX.  Whereas the “fast devices” are attached to para-virtual controllers.

I was surprised to see how much difference using para virtual (PV) SCSI drivers made to the guest response time once IOPS started to ramp up.  In these plots the y-axis is iostat “await” time.  The x-axis is time (each point is a 3 second average).

PVSCSI = Gey Dots
LSI Emulated SCSI = Red Dots
Lower is better.

 

Each plot is from a workload which  uses a different offered IO rate.  The offered rates are   8000,9000 and 10,000 the storage is able to meet the rates even though latency increases because there is a lot of outstanding IO.  The workload is mixed read/write with bursts.

8K IOPS9K IOPS10K IOPS

 

After converting sdh and sdi to PV SCSI the response time is again uniform across all devices.

10K IOPS PV

SuperScalin’: How I learned to stop worrying and love SQL Server on Nutanix.

TL;DR  It’s pretty easy to get 1M SQL TPM running a TPC-C like workload on a single Nutanix node.  Use 1 vDisk for Log files, and 6 vDisks for data files.  SQL Server  needs enough CPU and RAM to drive it.  I used 16 vCPU’s  and 64G of RAM.

Running database servers on Nutanix is an increasing trend and DBA’s are naturally skeptical about moving their DB’s to new platforms.  I recently had the chance to run some DB benchmarks on a couple of nodes in our lab.  My goal was to achieve 1M SQL transactions per node, and have that be linearly scalable across multiple nodes.

Screen Shot 2014-11-26 at 5.50.58 PM

It turned out to be ridiculously easy to generate decent numbers using SQL Server.  As a Unix and Oracle old-timer it was a shock to me, just how simple it is to throw up a SQL server instance.  In this experiment, I am using Windows Server 2012 and SQL-Server 2012.

For the test DB I provision 1 Disk for the SQL log files, and 6 disks for the data files.  Temp and the other system DB files are left unchanged.  Nothing is tuned or tweaked on the Nutanix side, everything is setup as per standard best practices – no “benchmark specials”.

SQL Server TPCC Scaling

Load is being generated by HammerDB configured to run the OLTP database workload.  I get a little over 1Million SQL transactions per minute (TPM) on a single Nutanix node.  The scaling is more-or-less linear, yielding 4.2 Million TPM  with 4 Nutanix nodes, which fit in a single 2U chassis . Each node is running both the DB itself, and the shared storage using NDFS.  I stopped at 6 nodes, because that’s all I had access to at the time.

The thing that blew me away in this was just how simple it had been.  Prior to using SQL server, I had been trying to set up Oracle to do the same workload.  It was a huge effort that took me back to the 1990’s, configuring kernel parameters by hand – just to stand up the DB.  I’ll come back to Oracle at a later date.

My SQL Server is configured with 16 vCPU’s and 64GB of RAM, so that the SQL server VM itself has as many resources as possible, so as not to be the bottleneck.

I use the following flags on SQL server.  In SQL terminology these are known as traceflags which are set in the SQL console (I used “DBCC trace status” to display the following.  These are fairly standard and are mentioned in our best practice guide.

Screen Shot 2014-11-30 at 8.38.45 PM

One thing I did change from the norm was to set the target recovery time to 240 seconds, rather than let SQL server determine the recovery time dynamically.  I found that in the benchmarking scenario, SQL server would not do any background flushing at all,  and then suddenly would checkpoint a huge amount of data which caused the TPM to fluctuate wildly.  With the recovery time hard coded to 240 seconds, the background page flusher keeps up with the incoming workload, and does not need to issue huge checkpoints.  My guess is that in real (non benchmark conditions) SQL server waits for the incoming work to drop-off and issues the checkpoint at that time.  Since my benchmark never backs off, SQL server eventually has to issue the checkpoint.

Screen Shot 2014-11-26 at 5.39.07 PM

 

Lord Kelvin Vs the IO blender

One of the characteristics of a  successful storage system for virtualized environments is that it must handle the IO blender.  Put simply, when lots of regular looking workloads are virtualized and presented to the storage, their regularity is lost, and the resulting IO stream starts to look more and more random.

 This is very similar to the way that synthesisers work – they take multiple regular sine waves of varying frequencies and add them together to get a much more complex sound.

 http://msp.ucsd.edu/techniques/v0.11/book-html/node14.html#fig01.08

That’s all pretty awesome for making cool space noises, but not so much when presented to the storage OS.  Without the ability to detect regularity, things like caching, pre-fetching and any kind of predictive algorithm break down.

That pre-fetch is never going to happen.

In Nutanix NOS we treat each of these sine waves (workloads) individually, never letting them get mixed together.  NDFS knows about vmdk’s or vhdx disks – and so by keeping the regular workloads separate we can still apply all the usual techniques to keep the bits flowing, even at high loads and disparate workload mixes that cause normal storage systems to fall over in a steaming heap of cache misses and metadata chaos.