SQL*Server on Nutanix. Force backups to HDD.

As an experiment, I wanted to (a) Create a HDD only container, and (b) measure the bandwidth I could achieve when backing up the SQL DB.  This was performed on a standard hybrid platform with only 4 HDD’s in the node.

First create a container, but add the special options “sequential-io-priority-order=DAS-SATA random-io-priority-order=DAS-SATA” which means that all IO will be directed to the HDD only. This also means that data on this container will never be migrated up. This is just fine for a backup that will hopefully never be read, and if it is – only once, sequentially.

ncli> ctr create name=cold-only sequential-io-priority-order=DAS-SATA random-io-priority-order=DAS-SATA sp-name=all
ncli> datastore create name=cold ctr-name=cold-only

Next create a vDisk in that container – this disk will contain the SQL Server backup data

Add vdisk to the cold-only container.
Add vdisk to the cold-only container.

Format and initialize the drive.

Format the drive to hold SQL backup.
Format the drive to hold SQL backup.

Add backup targets to the drive. Adding multiple targets increases throughput because SQL Server will generate 1-2 outstanding IO’s per target. I created 16 total targets (these are just files).

SQL Backup targets

The first backup is a little slow (~64MB/s), because we’re creating the files. A second (and subsequent) backups go faster, around  120 MB/s writing directly to the HDD spindles on a single node with 4 HDDs.

Overwrite old backups

This backup stream drives around 25MB/s per HDD spindle on the Nutanix node.  On a larger platform with more spindles – we could easily drive 500MB/s, and still skip SSD by writing directly to HDD.

25MB/s per spindle

120 MB/s Each way
Backup just started. About 115MB/s read, 115MB/s write on same node.

Completed backup:

Backup complete

SuperScalin’: How I learned to stop worrying and love SQL Server on Nutanix.

TL;DR  It’s pretty easy to get 1M SQL TPM running a TPC-C like workload on a single Nutanix node.  Use 1 vDisk for Log files, and 6 vDisks for data files.  SQL Server  needs enough CPU and RAM to drive it.  I used 16 vCPU’s  and 64G of RAM.

Running database servers on Nutanix is an increasing trend and DBA’s are naturally skeptical about moving their DB’s to new platforms.  I recently had the chance to run some DB benchmarks on a couple of nodes in our lab.  My goal was to achieve 1M SQL transactions per node, and have that be linearly scalable across multiple nodes.

Screen Shot 2014-11-26 at 5.50.58 PM

It turned out to be ridiculously easy to generate decent numbers using SQL Server.  As a Unix and Oracle old-timer it was a shock to me, just how simple it is to throw up a SQL server instance.  In this experiment, I am using Windows Server 2012 and SQL-Server 2012.

For the test DB I provision 1 Disk for the SQL log files, and 6 disks for the data files.  Temp and the other system DB files are left unchanged.  Nothing is tuned or tweaked on the Nutanix side, everything is setup as per standard best practices – no “benchmark specials”.

SQL Server TPCC Scaling

Load is being generated by HammerDB configured to run the OLTP database workload.  I get a little over 1Million SQL transactions per minute (TPM) on a single Nutanix node.  The scaling is more-or-less linear, yielding 4.2 Million TPM  with 4 Nutanix nodes, which fit in a single 2U chassis . Each node is running both the DB itself, and the shared storage using NDFS.  I stopped at 6 nodes, because that’s all I had access to at the time.

The thing that blew me away in this was just how simple it had been.  Prior to using SQL server, I had been trying to set up Oracle to do the same workload.  It was a huge effort that took me back to the 1990’s, configuring kernel parameters by hand – just to stand up the DB.  I’ll come back to Oracle at a later date.

My SQL Server is configured with 16 vCPU’s and 64GB of RAM, so that the SQL server VM itself has as many resources as possible, so as not to be the bottleneck.

I use the following flags on SQL server.  In SQL terminology these are known as traceflags which are set in the SQL console (I used “DBCC trace status” to display the following.  These are fairly standard and are mentioned in our best practice guide.

Screen Shot 2014-11-30 at 8.38.45 PM

One thing I did change from the norm was to set the target recovery time to 240 seconds, rather than let SQL server determine the recovery time dynamically.  I found that in the benchmarking scenario, SQL server would not do any background flushing at all,  and then suddenly would checkpoint a huge amount of data which caused the TPM to fluctuate wildly.  With the recovery time hard coded to 240 seconds, the background page flusher keeps up with the incoming workload, and does not need to issue huge checkpoints.  My guess is that in real (non benchmark conditions) SQL server waits for the incoming work to drop-off and issues the checkpoint at that time.  Since my benchmark never backs off, SQL server eventually has to issue the checkpoint.

Screen Shot 2014-11-26 at 5.39.07 PM


Lord Kelvin Vs the IO blender

One of the characteristics of a  successful storage system for virtualized environments is that it must handle the IO blender.  Put simply, when lots of regular looking workloads are virtualized and presented to the storage, their regularity is lost, and the resulting IO stream starts to look more and more random.

 This is very similar to the way that synthesisers work – they take multiple regular sine waves of varying frequencies and add them together to get a much more complex sound.


That’s all pretty awesome for making cool space noises, but not so much when presented to the storage OS.  Without the ability to detect regularity, things like caching, pre-fetching and any kind of predictive algorithm break down.

That pre-fetch is never going to happen.

In Nutanix NOS we treat each of these sine waves (workloads) individually, never letting them get mixed together.  NDFS knows about vmdk’s or vhdx disks – and so by keeping the regular workloads separate we can still apply all the usual techniques to keep the bits flowing, even at high loads and disparate workload mixes that cause normal storage systems to fall over in a steaming heap of cache misses and metadata chaos.


Designing a scaleout storage platform.

I was speaking to one of our developers the other day, and he pointed me to the following paper:  SEDA: An Architecture for Well-Conditioned, Scalable Internet Services as an example of the general philosophy behind the design of the Nutanix Distributed File System (NDFS).

Although the paper uses examples of both a webserver and a gnutella client, the philosophies are relevant to a large scale distributed filesystem.  In the case of NDFS we are serving disk blocks to clients who happen to be virtual machines.  One trade-off that is true in both cases is that scalability is traded for low latency in the single-stream case.  However at load, the response time is generally better than a system that is designed to low-latency, and then attempted to scale-up to achive high throughput.

At Nutanix we often talk about web-scale architectures, and this paper gives a pretty solid idea of what that might mean in concrete terms.

FWIW., according to google scholar, the paper has been cited 937 times, including Cassandra which is how we store filesystem meta-data in a distributed fashion.