Following on from the previous   experiments with Postgres & pgbench. A quick look at how the workload is seen from the Nutanix CVM.
The Linux VM running postgres has two virtual disks:
One is taking transaction log writes.
The other is doing reads and writes from the main datafiles.
Since the database size is small (50% the size of the Linux RAM) – the data is mostly cached inside the guest, and so most reads do not hit storage. As a result we only see writes going to the DB files.
Additionally, we see that database datafile writes the arrive in a bursty fashion, and that these write bursts are more intense (~10x) than the log file writes.
Despite the database flushes ocurring in bursts with a decent amount of concurrency the Nutanix CVM provides an average of 1.5ms write response time.
From the Nutanix CVM port 2009 handler, we can access the individual vdisk statistics. In this particular case vDisk 45269 is the data file disk, and 40043 is the database transaction log disk.
The vdisk categorizer correctly identifies the database datafile write pattern as highly random.
As a result, the writes are passed into the replicated oplog
Meanwhile the log writes are categorized as mostly sequential, which is expected for a database log file workload.
Even though the log writes are sequential, they are low-concurrency and small size (looks like mostly 16K-32K). This write pattern is also a good candidate for oplog.
In this example we run pgbench with a scale factor of 1000 which equates to a database size of around 15GB. The linux VM has 32G RAM, so we don’t expect to see many reads.
Using prometheus with the Linux node exporter we can see the disk IO pattern from pgbench. As expected the write pattern to the log disk (sda) is quite constant, while the write pattern to the database files (sdb) is bursty.
I had to tune the parameter checkpoint_completion_target from 0.5 to 0.9 otherwise the SCSI stack became overwhelmed during checkpoints, and caused log-writes to stall.
In this example, we use Postgres and the pgbench workload generator to drive some load in a virtual machine. Assume a Linux virtual machine that has Postgres installed. Specifically using a Bitnami virtual appliance.
Once the VM has been started, connect to the console
Allow access to postgres port 5432 – which is the postgres DB port or allow ssh
$ sudo ufw allow 5432
Note the postgres user password (cat ./bitnami_credentials)
Login to psql from the console or ssh
psql -U postgres
Optionally change password (the password prompted is the one from bitnami_credentials for the postgres database user).
psql -U postgres
postgres=# alter user postgres with password 'NEW_PASSWORD';
Create a DB to run the pgbench workload. In this case I name the db pgbench-sf10 for “Scale Factor 10”. Scale Factors are how the size of the database is determined.
$ sudo -u postgres createdb pgbench-sf10
Initialise the DB with data ready to run the benchmark. The “createdb” step just creates an empty schema.
-i means “initialize”
-s means “scale factor” e.g. 10
pgbench-sf10 is the database schema to use. We use the one just created pgbench-sf10
$ sudo -u postgres pgbench -i -s 10 pgbench-sf10
Noe run a workload against the DB schema called pgbench-sf10
$ sudo -u postgres pgbench pgbench-sf10
The workload pattern, and load on the system will vary greatly depending on the scale factor.
A 2007 paper, that still has lots to say on the subject of benchmarking storage and filesystems. Primarily aimed at researchers and developers, but is relevant to anyone about to embark on a benchmarking effort.
Understand what you are testing, cached results are fine – as long as that is what you had intended.
The authors are clear on why benchmarks remain important:
“Ideally, users could test performance in their own settings using real work- loads. This transfers the responsibility of benchmarking from author to user. However, this is usually impractical because testing multiple systems is time consuming, especially in that exposing the system to real workloads implies learning how to configure the system properly, possibly migrating data and other settings to the new systems, as well as dealing with their respective bugs.”
We cannot expect end-usersto be experts in benchmarking. It is out duty as experts to provide the tools (benchmarks) that enable users to make purchasing decisions without requiring years of benchmarking expertise.
For this experiment I am using Postgres v11 on Linux 3.10 kernel. The goal was to see what gains can be made from using hugepages. I use the “built in” benchmark pgbench to run a simple set of queries.
Since I am interested in only the gains from hugepages I chose to use the “-S” parameter to pgbench which means perform only the “select” statements. Obviously this masks any costs that might be seen when dirtying hugepages – but it kept the experiment from having to be concerned with writing to the filesystem.
The workstation has 32GB of memory Postgres is given 16GB of memory using the parameter
shared_buffers = 16384MB
pgbench creates a ~7.4gb database using a scale-factor of 500
Nutanix AOS 5.10 ships with a feature called Autonomous Extent Store (AES). AES effectively provides Metadata Locality to complement the existing data locality that has always existed. For large datasets (e.g. a 10TB database with 20% hot data) we observe a 2X improvement in throughput for random access across the 2TB hot dataset.
In our experiment we deliberately size the active working-set to NOT fit into the metadata cache. We uniformly access 2TB with a 100% random access pattern and record the time to access all 2TB. On the same hardware with AES enabled – the time is cut in half. As can be seen in the chart – the throughput is double, as expected.
It is the localization of metadata from AES that contributes to the 2X improvement. AES keeps most of the metadata local to the node – so there is no need to fetch data across-the-wire. Additionally AES reduces the need to cache metadata in DRAM since local access is so fast. For very large datasets, retrieving metadata can contribute a large proportion of the access time. This is true for all storage, so speeding up metadata resolution can make a dramatic improvement to overall throughput as we demonstrate.
During .Next 2018 in London, Nutanix announced performance improvements in the core-datapath said to give up to 2X performance improvements. Here’s a real-world example of that improvement in practice.
I am using X-Ray to simulate a 1TB data restore into an existing database. Specifically the IO sizes are large, an even split of 64K,128K,256K, 1MB and the pattern is 100% random across the entire 1TB dataset.
Normally storage benchmarks using large IO sizes are performed serially, because it’s easier on the storage back-end. That may be realistic for an initial load, but in this case we want to simulate a restore where the pattern is 100% random.
In this case the time to ingest 1TB drops by half when using Nutanix AOS 5.10 with Autonomous Extent Store (AES) enabled Vs the previous traditional extent store.
This improvement is possible because with AES, inserting directly into the extent store is much faster.
For throughput sensitive, random workloads, AES can detect that it will be faster to skip the oplog. Skipping oplog allows AES to eliminate a network round trip to a remote oplog – and instead only make an RF2 copy for the Extent Store. By contrast, when sustained, large random IO is funneled into oplog, the 10Gbit network can become the bottleneck. Even with faster networks, AES will still be a benefit because the CPU and SSD resource usage is also lower. Unfortunately I only have 10Gbit networking in my lab!
The X-Ray files needed to run this test are on github
In a previous post I showed a chart which plots concurrency [X-axis] against throughput (IOPS) on the Y-Axis. Here is that plot again:
Experienced performance chart ogglers will notice the familiar pattern of Littles Law, whereby throughput (X) rises quickly as concurrency (N) is increased. As we follow the chart to the right, the slope flattens out and we achieve a lower increase in throughput, even as we increase concurrency by the same amount at each stage. The flattening of the curve is best understood as Amdahls Law.
Anyone who follows Dr. Neil Gunther and his Universal Scalability Law, will also recognize this curve.
The USL states that taking the values of concurrency and throughput as inputs, we can in fact calculate the scalability of the system. Specifically we are able to calculate the key factors of contention and crosstalk – which limit absolute linear scalability and eventually result in less throughput as additional load is submitted even as the capacity of the system is saturated.
Using his Excel spreadsheet, I was able to input the numbers from my test and derive values that determine scalability.
Taking the largest number (0.074%) the “contention value” (i.e the impact we expect due to Amdahls law) as the limit to linear scaling – we can say that for this particular cluster, running this particular (simplistic/synthetic) workload the Nutanix cluster scales 99.926% linear. Although I did not crank up the concurrency beyond 576, the model shows us that this cluster will start to degrade performance if we try to push concurrency beyond 600 or so. Again, the USL model is for this particular workload – on this particular cluster. Doubling the concurrency of the offered load to 1200 will only net us 500,000 IOPS according to the model.
The high linearity (99.926%) is expected. The workload is 100% read, and with the data-locality feature of Nutanix filesystem – we expect close to 100% scalability.
We will return to these measures of scalability in the future to look at more realistic workloads.
The fio Pareto parameter allows us to create a workload, which references a very large dataset, but specify a hotspot for the access pattern. Here’s an example using the same setup as the ILM experiment, but using a Pareto value of 0:8. My fio file looks like this..
random_distribution=pareto:0.8 The experiment shows that with the access pattern as a Pareto ratio 0:8, meaning 20% of the overall dataset is “hot” the ILM process happens much faster as the hotspot is smaller, and is identified faster than a 100% uniform random access pattern. We would expect a similar shape for any sort of caching mechanism.
At some point potential Hyper-converged infrastructure (HCI) users want to know – “How fast does this thing go?”. The real question is “how do we measure that?”.
The simplest test is to run a single VM, with a single disk and issue a single IO at a time. We see often see this sort of test in bake-offs, and such a test does answer an important question – “what’s the lowest possible response time I can expect from the storage”.
However, this test only gives a single data point. Since nobody purchases a HCI cluster to run a single VM, we also need to know what happens when multiple VMs are run at the same time. This is a much more difficult test to conduct, and many end-users lack access and experience with tools that can give the full picture.
In the example below, the single VM, single vdisk, single IO result is at the very far left of the chart. Since it’s impossible to read I will tell you that the result is about 2,500 IOPS at ~400 microseconds. (in fact we know that if the IOPS are 2,500 the response time MUST be 400 microseconds 1/2,500 == .0004 seconds)
However with a single VM, the cluster is mostly idle, and has capacity to do much more work. In this X-Ray test I add another worker VM doing the exact same workload pattern to every node in the cluster every 5 minutes.
By the time we reach the end of the test, the total IOPS have increased to around 600,000 and the response time only increased by an additional 400 microseconds.
In other words the cluster was able to achieve 240X the amount of work measured by the single VM on a single node with only a 2X increase in response time, which is still less than 1ms.
The overall result is counter-intuitive, because the rate of change in IOPS (240X) is way out of line with the increase in response time (2X). The single VM test is using only a fraction of the cluster capacity.
When comparing HCI clusters to traditional storage arrays – you should expect the traditional array to outperform the cluster at the far left of the chart, but as work scales up the latent capacity of the HCI cluster is able to provide huge amounts of IO with very low response times.
Specifically a customer wanted to see how performance changes (and how quickly) as data moves from HDD to SSD automatically as data is accessed. The access pattern is 100% random across the entire disk.
In a hybrid Flash/HDD system – “cold” data (i.e. data that has not been accessed for a long time) is moved from SSD to HDD when the SSD capacity is exhausted. At some point in the future – that same data may become “hot” again, and so we want to make sure that the “newly hot” data is quickly moved back to the SSD tier. The duration of the above chart is around 5 minutes – and we see that by, around the 3 minute mark the entire dataset is resident on the SSD tier.
This X-ray test uses a couple of neat tricks to demonstrate ILM behavior.
Edit container preferences to send sequential data immediately to HDD
Overwriting data with NUL/Zero bytes frees the underlying data on Nutanix filesystem
To demonstrate ILM from HDD to SSD (ad ultimately into the DRAM cache on the CVM) we first have to ensure that we have data on the HDD in the first place. By default Nutanix OS will always try to write new data to SSD. To circumvent that behavior we can edit the container preferences. We use the fact that the “prefill” will be a sequential workload, while the measured workload will be a random workload. To make the change, use “ncli” to change the ” Sequential I/O Pri Order” to be HDD only.
In my case I happened to call my container “xray” since I didn’t want to change the default container. Now, when X-Ray executes the prefill stage, the data will land on HDD. As a second requirement, we want to see what happens when IO with different size blocks are issued so that we can get a chart similar to this: To achieve the desired behavior, we need to make sure that, at the beginning of each test, the data, again resides on HDD. The problem is that the data is up-migrated during the test. To do this we do an initial overwrite of the entire disk with “NULL” bytes using a parameter in fio “zero_buffers”. This causes the data to be freed on the Nutanix filesystem. Then we issue a normal profile with random data. Once the data is freed, then we know that the new initial writes will go to HDD – because we edited the container to do so. The overall test pattern looks like this
Create and clone VMs
Prefill with random data (Data will reside on HDD due to container edit)
Read disk with 16KB block size
Zero out the disks – to remove/free the up-migrated data
What happens when power is lost to all nodes of a HCI Cluster?
Ever wondered what happens when all power is simultaneously lost on a HCI cluster? One of the core principles of cloud design is that components are expected to fail, but the cluster as a whole should stay “up”. We wanted to see what happens when all components fail at once, so we designed an X-Ray test to do exactly that.
We start an OLTP workload on every node in the cluster, then X-Ray connects to the IPMI port on each node, and powers off all the hosts while the cluster is under load. In particular, the cluster is under read/write load (we need write workload, because we want to force the cluster to recover in-flight writes).
After power-off, we wait 10 seconds for everything to spin down, then immediately re-apply the power by connecting to the IPMI ports.
The nodes power up, and immediately start their POST (Power On Self Test) and boot the hypervisor. The CVM will auto-start, discover the available nodes and form the cluster.
X-Ray polls the cluster manager (either Prism or vCenter) to determine that the cluster is “up” and then restarts the OLTP workload.
Our testing showed that our Nutanix cluster completed POST, and was ready to restart work in around 10 minutes. Moreover, the time to achieve the recovery had very little variability. The chart below shows three separate runs on the same cluster.
This is the YAML file which defines the workload. The full specification is on github. The key part of the YAML is the nodes.PowerOff which connects to the IMPI ports of each node and vm_group.WaitForPowerOn which connects to either Nutanix Prism or vmware vCenter and determines that the cluster is formed, and ready to accept new work.
Creating a HCI benchmark to simulate multi-tennent workloads
HCI deployments are typically multi-tennant and often different nodes will support different types of workloads. It is very common to have large resource-hungry databases separated across nodes using anti-affinity rules. As with traditional storage, applications are writing to a shared storage environment which is necessary to support VM movement. It is the shared storage that often causes performance issues for data bases which are otherwise separated across nodes. We call this the noisy neighbor problem. A particular problem occurs when a reporting / analytical workload shares storage with a transactional workload.
In such a case we have a Bandwidth heavy workload profile (reporting) sharing with a Latency Sensitive workload (transactional)
In the past it has been difficult to measure the noisy neighbor impact without going to the trouble of configuring the entire DB stack, and finding some way to drive it. However in X-Ray we can do exactly this sort of workload. We supply a pre-configured scenario which we call the DB Colocation test.
The DB Colocation test utilizes two properties of X-Ray not found in other benchmarking tools
Time based benchmark actions
Distinct per-VM workload patterns
Ability to provision particular workloads, to particular hosts
In our example scenarioX-Ray begins by starting a workload modeled after a transactional DB (we call this the OLTP workload) on one of the nodes. This workload runs for 60 minutes. Then after 30 minutes X-Ray starts workloads modeled after reporting/analytical workloads on two other nodes (we call this the DSS workload).
After 30 minutes we have three independent workloads running on three independent nodes, but sharing the same storage. The key thing to observe is the impact on the latency sensitive (OLTP) workload. In this experiment it is the DSS workloads which are the noisy neighbor, since they will tend to utilize a lot of the storage bandwidth. An ideal result is one where there is very little interference with the running OLTP workload, even though we expect latency to increase. We can compare the impact on the OLTP workload by comparing the IOPS/response time during the first 30 minutes (no interference) with the remaining 60 minutes (after the DSS workloads are started). We should expect to see some increase in response time from the OLTP application because the other nodes in the cluster have gone from idle to under-load. The key thing to observe is whether the OLTP IOP target rate (4,000 IOPS) is achieved when the reporting workload is applied.
X-Ray Scenario configuration
We specify the timing rules and workloads in the test.yml file. You can modify this to contain whichever values suit your model. I covered editing an existing workload in Part 1.
The overall scenario begins with the OLTP workload, which will run for 3600 seconds (1 hour). The stagger_secs value is used if there are multiple OLTP sub-workloads. In the simple case we do use a single OLTP workload.
The scenario pauses for 1800 seconds using the test.wait specification then immediately starts the DSS workload
Finally the scenario uses the workload.Wait specification to wait for the OLTP workload to finish (approx 1 hour) before the test is deemed completed.
X-Ray Workload specification
The DB Co-Location test uses two workload profiles that aim to simulate transactional (OLTP) and reporting/analytical (DSS) workloads. The specifications for those workloads are contained in the two .fio files (oltp.fio and dss.fio)
The OLTP workload (oltp.fio) that we ship as has the following characteristcs based on typical configurations that we see in the field (of course you can change these to whatever you like).
Target IOP rate of 4,000 IOPS
4 “Data” Disks
50/50 read/write ratio.
90% 8KB, 10% 32KB bloc-ksize
8 outstanding IO per disk
2 “Log” Disks
1 outstanding IO per disk
The idea here is to simulate the two main storage workloads of a DB. The “data” portion and the “log” portion. Log writes are just used to commit transactions and so are 100% write. The only time the logs are read is during DB recovery, which is not part of this scenario. The “Data” disks are doing both reads (from DB cache misses) and writes committed transactions. A 50/50 read/write mix might be considered too write intensive – but we wanted to stress the storage in this scenario.
The DSS workload is configured to have the following characteristics
Target IOP rate of 1400 IOPS
4 “Data” Disks
100% Read workload with 1MB blocksize
10 Outstanding IOs
2 “Log” Disk
100% Write workload
1 outstanding IO per disk
The idea here is to simulate a large database doing a lot of reads across a large workingset size. The IO to the data disks is entirely read, and uses large blocks to simulate a database scanning a lot of records. The “Log” disks have a very light workload, purely to simulate an active database which is probably updating a few tables used for housekeeping.
Storage bus speeds with example storage endpoints.
Theoretical Bandwidth (MB/s)
HBA <-> Single SATA Drive
HBA <-> Single SAS Drive
HBA <-> SAS/SATA Fanout
4 Lane HBA to Breakout (6 SSD)
HBA <-> SAS/SATA Fanout
8 Lane HBA to Breakout (12 SSD)
Single Lane PCIe3
PCIe <-> SAS HBA or NVMe
Enough for Single NVMe
PICe <-> SAS HBA or NVMe
Enough for SAS-3 4 Lanes
PCIe Bus <-> Processor Socket
Xeon Direct conect to PCIe Bus
All figures here are the theoretical maximums for the busses using rough/easy calculations for bits/s<->bytes/s. Enough to figure out where the throughput bottlenecks are likely to be in a storage system.
SATA devices contain a single SAS/SATA port (connection), and even when they are connected to a SAS3 HBA, the SATA protocol limits each SSD device to ~600MB/s (single port, 6Gbit)
SAS devices may be dual ported (two connections to the device from the HBA(s)) – each with a 12Gbit connection giving a potential bandwidth of 2x12Gbit == 2.4Gbyte/s (roughly) per SSD device.
An NVMe device directly attached to the PCIe bus has access to a bandwidth of 4GB/s by using 4 PCIe lanes – or 8GB/s using 8 PCIe lanes. On current Xeon processors, a single socket attaches to 40 PCIe lanes directly (see diagram below) for a total bandwidth of 40GB/s per socket.
I first started down the road of finally coming to grips with all the different busses and lane types after reading this excellent LSI paper. I omitted the SAS-2 figures from this article since modern systems use SAS-3 exclusively.
[pdf-embedder url=”https://www.n0derunner.com/wp-content/uploads/2018/04/LSI-SAS-PCI-Bottlenecks.pdf” title=”LSI SAS PCI Bottlenecks”]
There are a lot of explanations for the current Meltdown/Spectre crisis but many did not do a good job of explaining the core issue if how information is leaked from the secret side, to the attackers side. This is my attempt to explain it (mostly to myself to make sure I got it right).
What is going on here generally?
Users and the kernel are normally protected from bad-actors via privileged modes, address page tables and the MMU.
It turns out that code executed speculatively can read any mapped memory. Even addresses/address that would not be readable in the normal program flow.
Thankfully illegal reads from speculatively executed code are not accessible to the attacker.
However, it turns out that we can execute a LOT of code in speculative mode if the pre-conditions are right.
In fact modern instruction pipelines (and slow memory) allow >100 instructions to be executed while memory reads are resolved.
How does it work?
The attacker reads the illegal memory using speculative execution, then uses the values read – to set data in cache lines that ARE LEGITIMATELY VISIBLE to the attacker. Thus creating a side channel between the speculatively executed code and the normal user written code.
The values in the cache lines are not readable (by user code) – but the fact that the cache lines were loaded (or not) *IS* detectable (via timing) since the L3 cache is shared across address-space.
First I ensure the cache lines I want to use in this process are empty.
Then I setup some code that reads an illegal value (using speculative execution technique), and depending on whether that value is 0 or !=0 I would read some other (specific address in the attackers address space) that I know will be cached in cache-line 1. Pretend I execute the second read only if the illegal value is !=0
Finally back in normal user code I attempt to read that same address in my “real” user space. And if I get a quick response – I know that the illegal value was !=0, because the only way I get a quick response is if the cache line was loaded during the speculative execution phase.
It turns out we can encode an entire byte using this method. See below.
The attacker reads a byte – then by using bit shifting etc. – the attacker encodes all 8 bits in 8 separate cache lines that can then be subsequently read.
At this point an attacker has read a memory address he was not allowed to, encoded that value in shared cache-lines and then tested the existence or not of values in the cache lines via timing, and thus re-constructs the value encoded in them during the speculative phase.
This is known as “leakage“.
Broadly there are two phases in this technique
The reading of illegal memory in speculative execution phase then encoding the byte in shared cache lines.
Using timing of reads to those same cache lines to determine if they were “set” (loaded e.g.”1″) or unset (empty “0”) by the attacker to decode the byte from the (set/unset 1/0) cache lines.
Side channels have been a known phenomena for years (at least since the 1990s) what’s different now if how easy, and with such little error rate – attackers are able to read arbitrary memory addresses.
I found these papers to be informative and readable.