At some point potential Hyper-converged infrastructure (HCI) users want to know – “How fast does this thing go?”. The real question is “how do we measure that?”.
The simplest test is to run a single VM, with a single disk and issue a single IO at a time. We see often see this sort of test in bake-offs, and such a test does answer an important question – “what’s the lowest possible response time I can expect from the storage”.
However, this test only gives a single data point. Since nobody purchases a HCI cluster to run a single VM, we also need to know what happens when multiple VMs are run at the same time. This is a much more difficult test to conduct, and many end-users lack access and experience with tools that can give the full picture.
In the example below, the single VM, single vdisk, single IO result is at the very far left of the chart. Since it’s impossible to read I will tell you that the result is about 2,500 IOPS at ~400 microseconds. (in fact we know that if the IOPS are 2,500 the response time MUST be 400 microseconds 1/2,500 == .0004 seconds)
However with a single VM, the cluster is mostly idle, and has capacity to do much more work. In this X-Ray test I add another worker VM doing the exact same workload pattern to every node in the cluster every 5 minutes.
By the time we reach the end of the test, the total IOPS have increased to around 600,000 and the response time only increased by an additional 400 microseconds.
In other words the cluster was able to achieve 240X the amount of work measured by the single VM on a single node with only a 2X increase in response time, which is still less than 1ms.
The overall result is counter-intuitive, because the rate of change in IOPS (240X) is way out of line with the increase in response time (2X). The single VM test is using only a fraction of the cluster capacity.
When comparing HCI clusters to traditional storage arrays – you should expect the traditional array to outperform the cluster at the far left of the chart, but as work scales up the latent capacity of the HCI cluster is able to provide huge amounts of IO with very low response times.
You can run this test yourself by adding this custom workload to X-Ray
Specifically a customer wanted to see how performance changes (and how quickly) as data moves from HDD to SSD automatically as data is accessed. The access pattern is 100% random across the entire disk.
In a hybrid Flash/HDD system – “cold” data (i.e. data that has not been accessed for a long time) is moved from SSD to HDD when the SSD capacity is exhausted. At some point in the future – that same data may become “hot” again, and so we want to make sure that the “newly hot” data is quickly moved back to the SSD tier. The duration of the above chart is around 5 minutes – and we see that by, around the 3 minute mark the entire dataset is resident on the SSD tier.
This X-ray test uses a couple of neat tricks to demonstrate ILM behavior.
Edit container preferences to send sequential data immediately to HDD
Overwriting data with NUL/Zero bytes frees the underlying data on Nutanix filesystem
To demonstrate ILM from HDD to SSD (ad ultimately into the DRAM cache on the CVM) we first have to ensure that we have data on the HDD in the first place. By default Nutanix OS will always try to write new data to SSD. To circumvent that behavior we can edit the container preferences. We use the fact that the “prefill” will be a sequential workload, while the measured workload will be a random workload. To make the change, use “ncli” to change the ” Sequential I/O Pri Order” to be HDD only.
In my case I happened to call my container “xray” since I didn’t want to change the default container. Now, when X-Ray executes the prefill stage, the data will land on HDD. As a second requirement, we want to see what happens when IO with different size blocks are issued so that we can get a chart similar to this: To achieve the desired behavior, we need to make sure that, at the beginning of each test, the data, again resides on HDD. The problem is that the data is up-migrated during the test. To do this we do an initial overwrite of the entire disk with “NULL” bytes using a parameter in fio “zero_buffers”. This causes the data to be freed on the Nutanix filesystem. Then we issue a normal profile with random data. Once the data is freed, then we know that the new initial writes will go to HDD – because we edited the container to do so. The overall test pattern looks like this
Create and clone VMs
Prefill with random data (Data will reside on HDD due to container edit)
Read disk with 16KB block size
Zero out the disks – to remove/free the up-migrated data
What happens when power is lost to all nodes of a HCI Cluster?
Ever wondered what happens when all power is simultaneously lost on a HCI cluster? One of the core principles of cloud design is that components are expected to fail, but the cluster as a whole should stay “up”. We wanted to see what happens when all components fail at once, so we designed an X-Ray test to do exactly that.
We start an OLTP workload on every node in the cluster, then X-Ray connects to the IPMI port on each node, and powers off all the hosts while the cluster is under load. In particular, the cluster is under read/write load (we need write workload, because we want to force the cluster to recover in-flight writes).
After power-off, we wait 10 seconds for everything to spin down, then immediately re-apply the power by connecting to the IPMI ports.
The nodes power up, and immediately start their POST (Power On Self Test) and boot the hypervisor. The CVM will auto-start, discover the available nodes and form the cluster.
X-Ray polls the cluster manager (either Prism or vCenter) to determine that the cluster is “up” and then restarts the OLTP workload.
Our testing showed that our Nutanix cluster completed POST, and was ready to restart work in around 10 minutes. Moreover, the time to achieve the recovery had very little variability. The chart below shows three separate runs on the same cluster.
This is the YAML file which defines the workload. The full specification is on github. The key part of the YAML is the nodes.PowerOff which connects to the IMPI ports of each node and vm_group.WaitForPowerOn which connects to either Nutanix Prism or vmware vCenter and determines that the cluster is formed, and ready to accept new work.
Creating a HCI benchmark to simulate multi-tennent workloads
HCI deployments are typically multi-tennant and often different nodes will support different types of workloads. It is very common to have large resource-hungry databases separated across nodes using anti-affinity rules. As with traditional storage, applications are writing to a shared storage environment which is necessary to support VM movement. It is the shared storage that often causes performance issues for data bases which are otherwise separated across nodes. We call this the noisy neighbor problem. A particular problem occurs when a reporting / analytical workload shares storage with a transactional workload.
In such a case we have a Bandwidth heavy workload profile (reporting) sharing with a Latency Sensitive workload (transactional)
In the past it has been difficult to measure the noisy neighbor impact without going to the trouble of configuring the entire DB stack, and finding some way to drive it. However in X-Ray we can do exactly this sort of workload. We supply a pre-configured scenario which we call the DB Colocation test.
The DB Colocation test utilizes two properties of X-Ray not found in other benchmarking tools
Time based benchmark actions
Distinct per-VM workload patterns
Ability to provision particular workloads, to particular hosts
In our example scenarioX-Ray begins by starting a workload modeled after a transactional DB (we call this the OLTP workload) on one of the nodes. This workload runs for 60 minutes. Then after 30 minutes X-Ray starts workloads modeled after reporting/analytical workloads on two other nodes (we call this the DSS workload).
After 30 minutes we have three independent workloads running on three independent nodes, but sharing the same storage. The key thing to observe is the impact on the latency sensitive (OLTP) workload. In this experiment it is the DSS workloads which are the noisy neighbor, since they will tend to utilize a lot of the storage bandwidth. An ideal result is one where there is very little interference with the running OLTP workload, even though we expect latency to increase. We can compare the impact on the OLTP workload by comparing the IOPS/response time during the first 30 minutes (no interference) with the remaining 60 minutes (after the DSS workloads are started). We should expect to see some increase in response time from the OLTP application because the other nodes in the cluster have gone from idle to under-load. The key thing to observe is whether the OLTP IOP target rate (4,000 IOPS) is achieved when the reporting workload is applied.
X-Ray Scenario configuration
We specify the timing rules and workloads in the test.yml file. You can modify this to contain whichever values suit your model. I covered editing an existing workload in Part 1.
The overall scenario begins with the OLTP workload, which will run for 3600 seconds (1 hour). The stagger_secs value is used if there are multiple OLTP sub-workloads. In the simple case we do use a single OLTP workload.
The scenario pauses for 1800 seconds using the test.wait specification then immediately starts the DSS workload
Finally the scenario uses the workload.Wait specification to wait for the OLTP workload to finish (approx 1 hour) before the test is deemed completed.
X-Ray Workload specification
The DB Co-Location test uses two workload profiles that aim to simulate transactional (OLTP) and reporting/analytical (DSS) workloads. The specifications for those workloads are contained in the two .fio files (oltp.fio and dss.fio)
OLTP
The OLTP workload (oltp.fio) that we ship as has the following characteristcs based on typical configurations that we see in the field (of course you can change these to whatever you like).
Target IOP rate of 4,000 IOPS
4 “Data” Disks
50/50 read/write ratio.
90% 8KB, 10% 32KB bloc-ksize
8 outstanding IO per disk
2 “Log” Disks
100% write
90% sequential
32k block-size
1 outstanding IO per disk
The idea here is to simulate the two main storage workloads of a DB. The “data” portion and the “log” portion. Log writes are just used to commit transactions and so are 100% write. The only time the logs are read is during DB recovery, which is not part of this scenario. The “Data” disks are doing both reads (from DB cache misses) and writes committed transactions. A 50/50 read/write mix might be considered too write intensive – but we wanted to stress the storage in this scenario.
DSS
The DSS workload is configured to have the following characteristics
Target IOP rate of 1400 IOPS
4 “Data” Disks
100% Read workload with 1MB blocksize
10 Outstanding IOs
2 “Log” Disk
100% Write workload
90% sequential
32K block-size
1 outstanding IO per disk
The idea here is to simulate a large database doing a lot of reads across a large workingset size. The IO to the data disks is entirely read, and uses large blocks to simulate a database scanning a lot of records. The “Log” disks have a very light workload, purely to simulate an active database which is probably updating a few tables used for housekeeping.
Storage bus speeds with example storage endpoints.
Bus
Lanes
End-Point
Theoretical Bandwidth (MB/s)
Note
SAS-3
1
HBA <-> Single SATA Drive
600
SAS3<->SATA 6Gbit
SAS-3
1
HBA <-> Single SAS Drive
1200
SAS3<->SAS3 12Gbit
SAS-3
4
HBA <-> SAS/SATA Fanout
4800
4 Lane HBA to Breakout (6 SSD)[2]
SAS-3
8
HBA <-> SAS/SATA Fanout
8400
8 Lane HBA to Breakout (12 SSD)[1]
PCIe-3
1
N/A
1000
Single Lane PCIe3
PCIe-3
4
PCIe <-> SAS HBA or NVMe
4000
Enough for Single NVMe
PCIe-3
8
PICe <-> SAS HBA or NVMe
8000
Enough for SAS-3 4 Lanes
PCIe-3
40
PCIe Bus <-> Processor Socket
40000
Xeon Direct conect to PCIe Bus
Notes
All figures here are the theoretical maximums for the busses using rough/easy calculations for bits/s<->bytes/s. Enough to figure out where the throughput bottlenecks are likely to be in a storage system.
SATA devices contain a single SAS/SATA port (connection), and even when they are connected to a SAS3 HBA, the SATA protocol limits each SSD device to ~600MB/s (single port, 6Gbit)
SAS devices may be dual ported (two connections to the device from the HBA(s)) – each with a 12Gbit connection giving a potential bandwidth of 2x12Gbit == 2.4Gbyte/s (roughly) per SSD device.
An NVMe device directly attached to the PCIe bus has access to a bandwidth of 4GB/s by using 4 PCIe lanes – or 8GB/s using 8 PCIe lanes. On current Xeon processors, a single socket attaches to 40 PCIe lanes directly (see diagram below) for a total bandwidth of 40GB/s per socket.
I first started down the road of finally coming to grips with all the different busses and lane types after reading this excellent LSI paper. I omitted the SAS-2 figures from this article since modern systems use SAS-3 exclusively.
[pdf-embedder url=”https://www.n0derunner.com/wp-content/uploads/2018/04/LSI-SAS-PCI-Bottlenecks.pdf” title=”LSI SAS PCI Bottlenecks”]
It’s good to detect corrupted data. It’s even better to transparently repair that data and return the correct data to the user. Here we will demonstrate how Nutanix filesystem detects and corrects corruption. Not all systems are made equally in this regard. The topic of corruption detection and remedy was the focus of this excellent Usenix paper Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions. The authors find that many systems that should in theory be able to recover corrupted data do not in fact do so.
Within the guest Virtual Machine
Start with a Linux VM and write a specific pattern (0xdeadbeef) to /dev/sdg using fio.
Check that the expected data is written to the virtual disk and generate a SHA1 checksum of the entire disk.
The “od” command shows us that the entire 1GB disk contains the pattern 0xdeadbeef
The “sha1sum” command creates a checksum (digest) based on the content of the entire disk.
Within the Nutanix CVM
Connect to the Nutanix CVM
Locate one of the 4MB egroups that back this virtual disk on the node.
The virtual disk which belongs to the guest vm (/dev/sdg) is represented in the Nutanix cluster as a series of “Egroups” within the Nutanix filesystem.
Using some knowledge of the internals I can locate the Egroups which make up the vDisk seen by the guest.
Double check that this is indeed an Egroup belonging to my vDisk by checking that it contains the expected pattern (0xdeadbeef)
Now simulate a hardware failure and overwrite the egroup with null data
I do this by reaching underneath the cluster filesystem and deliberately creating corruption, simulating a mis-directed write somewhere in the system.
If the system does not correct this situation, the user VM will not read 0xdeadbeef as it expects – remember the corruption happened outside of the user VM itself.
There are a lot of explanations for the current Meltdown/Spectre crisis but many did not do a good job of explaining the core issue if how information is leaked from the secret side, to the attackers side. This is my attempt to explain it (mostly to myself to make sure I got it right).
What is going on here generally?
Generally speaking an adversary would like to read pieces of memory he is not allowed to. This can be either reading from kernel memory, or reading memory in the same address space that should not be allowed. e.g. javascript from a random web page should not be able to read the passwords stored in your browser.
Users and the kernel are normally protected from bad-actors via privileged modes, address page tables and the MMU.
It turns out that code executed speculatively can read any mapped memory. Even addresses/address that would not be readable in the normal program flow.
Thankfully illegal reads from speculatively executed code are not accessible to the attacker.
So, although the speculatively executed code reads an illegal value in micro-code, it is not visible to user-written code (e.g. the javascript in the browser)
However, it turns out that we can execute a LOT of code in speculative mode if the pre-conditions are right.
In fact modern instruction pipelines (and slow memory) allow >100 instructions to be executed while memory reads are resolved.
How does it work?
The attacker reads the illegal memory using speculative execution, then uses the values read – to set data in cache lines that ARE LEGITIMATELY VISIBLE to the attacker. Thus creating a side channel between the speculatively executed code and the normal user written code.
The values in the cache lines are not readable (by user code) – but the fact that the cache lines were loaded (or not) *IS* detectable (via timing) since the L3 cache is shared across address-space.
First I ensure the cache lines I want to use in this process are empty.
Then I setup some code that reads an illegal value (using speculative execution technique), and depending on whether that value is 0 or !=0 I would read some other (specific address in the attackers address space) that I know will be cached in cache-line 1. Pretend I execute the second read only if the illegal value is !=0
Finally back in normal user code I attempt to read that same address in my “real” user space. And if I get a quick response – I know that the illegal value was !=0, because the only way I get a quick response is if the cache line was loaded during the speculative execution phase.
It turns out we can encode an entire byte using this method. See below.
The attacker reads a byte – then by using bit shifting etc. – the attacker encodes all 8 bits in 8 separate cache lines that can then be subsequently read.
At this point an attacker has read a memory address he was not allowed to, encoded that value in shared cache-lines and then tested the existence or not of values in the cache lines via timing, and thus re-constructs the value encoded in them during the speculative phase.
This is known as “leakage“.
Broadly there are two phases in this technique
The reading of illegal memory in speculative execution phase then encoding the byte in shared cache lines.
Using timing of reads to those same cache lines to determine if they were “set” (loaded e.g.”1″) or unset (empty “0”) by the attacker to decode the byte from the (set/unset 1/0) cache lines.
Side channels have been a known phenomena for years (at least since the 1990s) what’s different now if how easy, and with such little error rate – attackers are able to read arbitrary memory addresses.
I found these papers to be informative and readable.
As performance analysts we often have to summarize large amounts of data in order to make engineering decisions or understand existing behavior. This paper will help you do exactly that! Many analysts know that using statistics can help, but statistical analysis is a huge field in itself and has its own complexity. The article below distills the essential techniques that can help you with typical performance analysis tasks.
[pdf-embedder url=”https://www.n0derunner.com/wp-content/uploads/2018/01/Statistics-for-the-performance-analyst.pdf” title=”Statistics for the performance analyst”]
We have started seeing misaligned partitions on Linux guests runnning certain HDFS distributions. How these partitions became mis-aligned is a bit of a mystery, because the only way I know how to do this on Linux is to create a partition using old DOS format like this (using -c=dos and -u=cylinders) Continue reading →
Often we are presented with a vCenter screenshot, and an observation that there are “high latency spikes”. In the example, the response time is indeed quite high – around 80ms. Continue reading →
buffer_compress_percentage does what you’d expect and specifies how compressible the data is
refill_buffers Is required to make the above compress percentage do what you’d expect in the large. IOW, I want the entire file to be compressible by the buffer_compress_percentage amount
buffer_pattern This is a big one. Without setting this pattern, fio will use Null bytes to achieve compressibility, and Nutanix like many other storage vendors will suppress runs of Zero’s and so the data reduction will mostly be from zero suppression rather than from compression.
Much of this is well explained in the README for latest version of fio.
Also NOTE Older versions of fio do not support many of the fancy data creation flags, but will not alert you to the fact that fio is ignoring them. I spent quite a bit of time wondering why my data was not compressed, until I downloaded and compiled the latest fio.
A downtime classic, for several months in 2013 the troubles of a very particular website were front page news across the US. Full Story from Time Magazine (PDF)
When benchmarking filesystems or storage, we need to understand the caching effects. Most often this involves filling the cache and reaching steady state. But how long will it take to fill a cache of a given size? The answer depends of course on the size of the cache, the IO size and the IO rate. So, to simpify let’s just say that a cache consists of some number of entries. For instance a 4GB cache would have 1 million 4KB entries. In my example this is simply a 1M entry cache.
In terms of time to fill the cache, it’s simpler to think about how many entries will need to be read before the cache is filled.
For a random workload, it will be more than 1M “reads”. Let’s see why.
The first read will be inserted into the cache, the second read will probably be inserted into the cache, but there is a small (1/1000000) chance that the second read will actually be already in the cache since it’s random. As the cache gets fuller – the chances of a given read already being present in cache increases. As a result it will take a lot more than 1 million reads to populate the entire cache with a random read workload.
The question is this. Is is possible to predict, how many “reads” it will take to fill the cahe?
The experiment.
In this experiment, we create an array to represent the cache. It has 1M entries. Then using a random number generator, simulate the workload and measure how long it takes to populate the cahche.
Results
After 1,000,000 “reads” there are 633,000 positive entries (entries that have data in them). So what happened to the other 367,000? The 367,000 represent cache “hits” on an existing entry. Since the read “workload” is 100% random, there is some chance that a subsequent read will be for an entry that is already cached. Over the life of 1,000,000 reads around 37% are for an entry that is already cached.
After 2,0000,000 reads the cache contains 864,000 entries. Another 1,000,000 reads yields 950,000.
The fuller that the cache becomes, the fewer new entries are added. Intuitively this makes sense because as the cache becomes more full, more of the “random reads” are satisfied by an existing cache entry.
In my experiments it takes about 17,000,000 “reads” to ensure that every cache entry is filled in a 1M entry cache. Here are the data for 19 runs.
Iteration
Positive Entries
Empty Slots
1
631998
368002
2
864334
135666
3
950184
49816
4
981630
18370
5
993266
6734
6
997577
2423
7
999080
920
8
999660
340
9
999879
121
10
999951
49
11
999985
15
12
999996
4
13
999998
2
14
999998
2
15
999999
1
16
999999
1
17
1000000
0
18
1000000
0
For 500,000 Entries it takes 15 iterations to fill all the entries.
For 2,000,000 Entries it takes 19 iterations to fill all the entries.
Interestingly, the ratio of positive to empty entries after one iteration is always about 0.632:0.368
As an experiment, I wanted to (a) Create a HDD only container, and (b) measure the bandwidth I could achieve when backing up the SQL DB. This was performed on a standard hybrid platform with only 4 HDD’s in the node.
First create a container, but add the special options “sequential-io-priority-order=DAS-SATA random-io-priority-order=DAS-SATA” which means that all IO will be directed to the HDD only. This also means that data on this container will never be migrated up. This is just fine for a backup that will hopefully never be read, and if it is – only once, sequentially.
Next create a vDisk in that container – this disk will contain the SQL Server backup data
Add vdisk to the cold-only container.
Format and initialize the drive.
Format the drive to hold SQL backup.
Add backup targets to the drive. Adding multiple targets increases throughput because SQL Server will generate 1-2 outstanding IO’s per target. I created 16 total targets (these are just files).
The first backup is a little slow (~64MB/s), because we’re creating the files. A second (and subsequent) backups go faster, around 120 MB/s writing directly to the HDD spindles on a single node with 4 HDDs.
This backup stream drives around 25MB/s per HDD spindle on the Nutanix node. On a larger platform with more spindles – we could easily drive 500MB/s, and still skip SSD by writing directly to HDD.
Backup just started. About 115MB/s read, 115MB/s write on same node.
Recently I found that vdbench was not giving me the amount of outstanding IO that I had intended to configure by using the “threads=N” parameter. It turned out that with Linux, most of the filesystems (ext2, ext3 and ext4) do not support concurrent directIO, although they do support directIO. This was a bit of a shock coming from Solaris which had concurrent directIO since 2001.
All the Linux filesystems I tested allow multiple outstanding IO’s if the IO is submitted using asynchronous IO (AKA asyncIO or AIO) but not when using multiple writer threads (except XFS). Unfortunately vdbench does not allow AIO since it tries to be platform agnostic.
fio however does allow either threads or AIO to be used and so that’s what I used in the experiments below.
The column fio QD is the amount of outstanding IO, or Queue Depth that is intended to be passed to the storage device. The column iostat QD is the actual Queue Depth seen by the device. The iostat QD is not “8” because the response time is so low that fio cannot issue the IO’s quickly enough to maintain the intended queue depth.
Device
fio QD
fio QD Type
direct
iostat QD
ps -efT | grep fio | wc -l
/dev/sd
8
libaio
Yes
7
5
/dev/sd
8
Threads
Yes
7
12
ext2 fs (mke2fs)
8
Threads
Yes
1
12
ext2 fs (mke2fs)
8
libaio
Yes
7
5
ext3 (mkfs -t ext3)
8
Threads
Yes
1
12
ext3 (mkfs -t ext3)
8
libaio
Yes
7
5
ext4 (mkfs -t ext4)
8
Threads
Yes
1
12
ext4 (mkfs -t ext4)
8
libaio
Yes
7
5
xfs (mkfs -t xfs)
8
Threads
Yes
7
12
xfs (mkfs -t xfs)
8
libaio
Yes
7
5
At any rate, all is not lost – using raw devices (/dev/sdX) will give concurrent directIO, as will XFS. These issues are well known by Linux DB guys, and I found interesting articles from Percona and Kevin Closson after I finally figured out what was going on with vdbench.
The question of why Nutanix uses SATA drive comes up sometimes, especially from customers who have experienced very poor performance using SATA on traditional arrays.
I can understand this anxiety. In my time at NetApp we exclusively used SAS or FC-AL drives in performance test work. At the time there was a huge difference in performance between SCSI and SATA. Even a few short years ago, FC typically spun at 15K RPM whereas SATA was stuck at about a 5K RPM, so experiencing 3X the rotational delay.
These days SAS and SATA are both available in 7200 RPM configurations, and these are the type we use in standard Nutanix nodes. In fact the SATA drives that we use are marketed by Seagate as “Nearline SAS” or NL-SAS. Mainly to differentiate them from the consumer grade SATA drives that are found in cheap laptops. There are hundreds of SAS Vs SATA articles on the web, so I won’t go over the theoretical/historical arguments.
SATA in Hybrid/Tiered Storage
In a Nutanix cluster the “heavy lifting” of IO is mainly done by the SSD’s – leaving the SATA drives to service the few remaining IO’s that miss the SSD tier. Under moderate load, the SATA spindles do pretty well, and since the SATA $/GB is only 60% of SAS. SATA seems like a good choice for mostly-cold data.
Let’s Experiment.
From a performance perspective, I decided to run a few experiments to see just how well SATA performs. In the test, the SATA drives are Nutanix standard drives “ST91000640NS” (Seagate, priced around $150). The comparable SAS drives are the same form-factor (2.5 Inch) “AL13SEB900” (Toshiba, priced at about $250 USD). These drives spin at 10K RPM. Both drives hold around 1TB.
There are three experiments per drive type to reveal the impact of seek-times. This is achieved using the “filesize” parameter of fio – which determines the LBA range to read. One thing to note, is that I use a queue-depth of one. Therefore IOPs can be calculated as simply 1/Response-Time (converted to seconds).
Somewhat intuitively as the working-set (seek) gets larger, the difference between “Real SAS” and “NL-SAS/SATA” gets wider. This is intuitive because with a 1GB working-set, the seek-time is close to zero, and so only the rotational delay (based on RPM) is a factor. In fact the difference in response time is the same as the difference in rotational speed (1:1.3).
Also (just for fun) I used the “random_distribution=zipf” function in fio to test the response time when reading across the entire range of the disk – but with a “hotspot” (zipf) rather than a uniform random read – which is pretty unrealistic.
In the “realistic” case – reading across the entire disk on the SATA drives shipped with Nutanix nodes is capable of 8.5 ms response time at 125 IOPS per spindle.
Conclusion
The performance difference between SAS and SATA is often over-stated. At moderate loads SATA performs well enough for most use-cases. Even when delivering fully random IO over the entirety of the disk – SATA can deliver 8K in less than 15ms. Using a more realistic (not 100% random) access pattern the response time is < 10ms.
For a properly sized Nutanix implementation, the intent is to service most IO from Flash. It’s OK to generate some work on HDD from time-to-time even on SATA.
TL;DR Comparison of Paravirtual SCSI Vs Emulated SCSI in with measurements. PVSCSI gives measurably better response times at high load.
During a performance debugging session, I noticed that the response time on two of the SCSI devices was much higher than the others (Linux host under vmware ESX). The difference was unexpected since all the devices were part of the same stripe doing a uniform synthetic workload.
iostat output from the system under investigation.
The immediate observation is that queue length is higher, as is wait time. All these devices reside on the same back-end storage so I am looking for something else. When I traced back the devices it turned out that the “slow devices” were attached to LSI emulated controllers in ESX. Whereas the “fast devices” are attached to para-virtual controllers.
I was surprised to see how much difference using para virtual (PV) SCSI drivers made to the guest response time once IOPS started to ramp up. In these plots the y-axis is iostat “await” time. The x-axis is time (each point is a 3 second average).
PVSCSI = Gey Dots LSI Emulated SCSI = Red Dots Lower is better.
Each plot is from a workload which uses a different offered IO rate. The offered rates are 8000,9000 and 10,000 the storage is able to meet the rates even though latency increases because there is a lot of outstanding IO. The workload is mixed read/write with bursts.
After converting sdh and sdi to PV SCSI the response time is again uniform across all devices.
TL;DR It’s pretty easy to get 1M SQL TPM running a TPC-C like workload on a single Nutanix node. Use 1 vDisk for Log files, and 6 vDisks for data files. SQL Server needs enough CPU and RAM to drive it. I used 16 vCPU’s and 64G of RAM.
Running database servers on Nutanix is an increasing trend and DBA’s are naturally skeptical about moving their DB’s to new platforms. I recently had the chance to run some DB benchmarks on a couple of nodes in our lab. My goal was to achieve 1M SQL transactions per node, and have that be linearly scalable across multiple nodes.
It turned out to be ridiculously easy to generate decent numbers using SQL Server. As a Unix and Oracle old-timer it was a shock to me, just how simple it is to throw up a SQL server instance. In this experiment, I am using Windows Server 2012 and SQL-Server 2012.
For the test DB I provision 1 Disk for the SQL log files, and 6 disks for the data files. Temp and the other system DB files are left unchanged. Nothing is tuned or tweaked on the Nutanix side, everything is setup as per standard best practices – no “benchmark specials”.
Load is being generated by HammerDB configured to run the OLTP database workload. I get a little over 1Million SQL transactions per minute (TPM) on a single Nutanix node. The scaling is more-or-less linear, yielding 4.2 Million TPM with 4 Nutanix nodes, which fit in a single 2U chassis . Each node is running both the DB itself, and the shared storage using NDFS. I stopped at 6 nodes, because that’s all I had access to at the time.
The thing that blew me away in this was just how simple it had been. Prior to using SQL server, I had been trying to set up Oracle to do the same workload. It was a huge effort that took me back to the 1990’s, configuring kernel parameters by hand – just to stand up the DB. I’ll come back to Oracle at a later date.
My SQL Server is configured with 16 vCPU’s and 64GB of RAM, so that the SQL server VM itself has as many resources as possible, so as not to be the bottleneck.
I use the following flags on SQL server. In SQL terminology these are known as traceflags which are set in the SQL console (I used “DBCC trace status” to display the following. These are fairly standard and are mentioned in our best practice guide.
One thing I did change from the norm was to set the target recovery time to 240 seconds, rather than let SQL server determine the recovery time dynamically. I found that in the benchmarking scenario, SQL server would not do any background flushing at all, and then suddenly would checkpoint a huge amount of data which caused the TPM to fluctuate wildly. With the recovery time hard coded to 240 seconds, the background page flusher keeps up with the incoming workload, and does not need to issue huge checkpoints. My guess is that in real (non benchmark conditions) SQL server waits for the incoming work to drop-off and issues the checkpoint at that time. Since my benchmark never backs off, SQL server eventually has to issue the checkpoint.