In a previous post I showed a chart which plots concurrency [X-axis] against throughput (IOPS) on the Y-Axis. Here is that plot again:
Experienced performance chart ogglers will notice the familiar pattern of Littles Law, whereby throughput (X) rises quickly as concurrency (N) is increased. As we follow the chart to the right, the slope flattens out and we achieve a lower increase in throughput, even as we increase concurrency by the same amount at each stage. The flattening of the curve is best understood as Amdahls Law.
Anyone who follows Dr. Neil Gunther and his Universal Scalability Law, will also recognize this curve.
The USL states that taking the values of concurrency and throughput as inputs, we can in fact calculate the scalability of the system. Specifically we are able to calculate the key factors of contention and crosstalk – which limit absolute linear scalability and eventually result in less throughput as additional load is submitted even as the capacity of the system is saturated.
Using his Excel spreadsheet, I was able to input the numbers from my test and derive values that determine scalability.
Taking the largest number (0.074%) the “contention value” (i.e the impact we expect due to Amdahls law) as the limit to linear scaling – we can say that for this particular cluster, running this particular (simplistic/synthetic) workload the Nutanix cluster scales 99.926% linear. Although I did not crank up the concurrency beyond 576, the model shows us that this cluster will start to degrade performance if we try to push concurrency beyond 600 or so. Again, the USL model is for this particular workload – on this particular cluster. Doubling the concurrency of the offered load to 1200 will only net us 500,000 IOPS according to the model.
The high linearity (99.926%) is expected. The workload is 100% read, and with the data-locality feature of Nutanix filesystem – we expect close to 100% scalability.
We will return to these measures of scalability in the future to look at more realistic workloads.
Creating a HCI benchmark to simulate multi-tennent workloads
HCI deployments are typically multi-tennant and often different nodes will support different types of workloads. It is very common to have large resource-hungry databases separated across nodes using anti-affinity rules. As with traditional storage, applications are writing to a shared storage environment which is necessary to support VM movement. It is the shared storage that often causes performance issues for data bases which are otherwise separated across nodes. We call this the noisy neighbor problem. A particular problem occurs when a reporting / analytical workload shares storage with a transactional workload.
In such a case we have a Bandwidth heavy workload profile (reporting) sharing with a Latency Sensitive workload (transactional)
In the past it has been difficult to measure the noisy neighbor impact without going to the trouble of configuring the entire DB stack, and finding some way to drive it. However in X-Ray we can do exactly this sort of workload. We supply a pre-configured scenario which we call the DB Colocation test.
The DB Colocation test utilizes two properties of X-Ray not found in other benchmarking tools
Time based benchmark actions
Distinct per-VM workload patterns
Ability to provision particular workloads, to particular hosts
In our example scenarioX-Ray begins by starting a workload modeled after a transactional DB (we call this the OLTP workload) on one of the nodes. This workload runs for 60 minutes. Then after 30 minutes X-Ray starts workloads modeled after reporting/analytical workloads on two other nodes (we call this the DSS workload).
After 30 minutes we have three independent workloads running on three independent nodes, but sharing the same storage. The key thing to observe is the impact on the latency sensitive (OLTP) workload. In this experiment it is the DSS workloads which are the noisy neighbor, since they will tend to utilize a lot of the storage bandwidth. An ideal result is one where there is very little interference with the running OLTP workload, even though we expect latency to increase. We can compare the impact on the OLTP workload by comparing the IOPS/response time during the first 30 minutes (no interference) with the remaining 60 minutes (after the DSS workloads are started). We should expect to see some increase in response time from the OLTP application because the other nodes in the cluster have gone from idle to under-load. The key thing to observe is whether the OLTP IOP target rate (4,000 IOPS) is achieved when the reporting workload is applied.
X-Ray Scenario configuration
We specify the timing rules and workloads in the test.yml file. You can modify this to contain whichever values suit your model. I covered editing an existing workload in Part 1.
The overall scenario begins with the OLTP workload, which will run for 3600 seconds (1 hour). The stagger_secs value is used if there are multiple OLTP sub-workloads. In the simple case we do use a single OLTP workload.
The scenario pauses for 1800 seconds using the test.wait specification then immediately starts the DSS workload
Finally the scenario uses the workload.Wait specification to wait for the OLTP workload to finish (approx 1 hour) before the test is deemed completed.
X-Ray Workload specification
The DB Co-Location test uses two workload profiles that aim to simulate transactional (OLTP) and reporting/analytical (DSS) workloads. The specifications for those workloads are contained in the two .fio files (oltp.fio and dss.fio)
The OLTP workload (oltp.fio) that we ship as has the following characteristcs based on typical configurations that we see in the field (of course you can change these to whatever you like).
Target IOP rate of 4,000 IOPS
4 “Data” Disks
50/50 read/write ratio.
90% 8KB, 10% 32KB bloc-ksize
8 outstanding IO per disk
2 “Log” Disks
1 outstanding IO per disk
The idea here is to simulate the two main storage workloads of a DB. The “data” portion and the “log” portion. Log writes are just used to commit transactions and so are 100% write. The only time the logs are read is during DB recovery, which is not part of this scenario. The “Data” disks are doing both reads (from DB cache misses) and writes committed transactions. A 50/50 read/write mix might be considered too write intensive – but we wanted to stress the storage in this scenario.
The DSS workload is configured to have the following characteristics
Target IOP rate of 1400 IOPS
4 “Data” Disks
100% Read workload with 1MB blocksize
10 Outstanding IOs
2 “Log” Disk
100% Write workload
1 outstanding IO per disk
The idea here is to simulate a large database doing a lot of reads across a large workingset size. The IO to the data disks is entirely read, and uses large blocks to simulate a database scanning a lot of records. The “Log” disks have a very light workload, purely to simulate an active database which is probably updating a few tables used for housekeeping.
When benchmarking filesystems or storage, we need to understand the caching effects. Most often this involves filling the cache and reaching steady state. But how long will it take to fill a cache of a given size? The answer depends of course on the size of the cache, the IO size and the IO rate. So, to simpify let’s just say that a cache consists of some number of entries. For instance a 4GB cache would have 1 million 4KB entries. In my example this is simply a 1M entry cache.
In terms of time to fill the cache, it’s simpler to think about how many entries will need to be read before the cache is filled.
For a random workload, it will be more than 1M “reads”. Let’s see why.
The first read will be inserted into the cache, the second read will probably be inserted into the cache, but there is a small (1/1000000) chance that the second read will actually be already in the cache since it’s random. As the cache gets fuller – the chances of a given read already being present in cache increases. As a result it will take a lot more than 1 million reads to populate the entire cache with a random read workload.
The question is this. Is is possible to predict, how many “reads” it will take to fill the cahe?
In this experiment, we create an array to represent the cache. It has 1M entries. Then using a random number generator, simulate the workload and measure how long it takes to populate the cahche.
After 1,000,000 “reads” there are 633,000 positive entries (entries that have data in them). So what happened to the other 367,000? The 367,000 represent cache “hits” on an existing entry. Since the read “workload” is 100% random, there is some chance that a subsequent read will be for an entry that is already cached. Over the life of 1,000,000 reads around 37% are for an entry that is already cached.
After 2,0000,000 reads the cache contains 864,000 entries. Another 1,000,000 reads yields 950,000.
The fuller that the cache becomes, the fewer new entries are added. Intuitively this makes sense because as the cache becomes more full, more of the “random reads” are satisfied by an existing cache entry.
In my experiments it takes about 17,000,000 “reads” to ensure that every cache entry is filled in a 1M entry cache. Here are the data for 19 runs.
For 500,000 Entries it takes 15 iterations to fill all the entries.
For 2,000,000 Entries it takes 19 iterations to fill all the entries.
Interestingly, the ratio of positive to empty entries after one iteration is always about 0.632:0.368