Quick & Dirty Prometheus on OS-X

How to install Prometheus on OS-X

Install prometheus

  • Download the compiled prometheus binaries from prometheus.io
  • Unzip the binary and cd into the directory.
  • Run the prometheus binary, from the command line, it will listen on port 9090
$ cd /Users/gary.little/Downloads/prometheus-2.16.0-rc.0.darwin-amd64
$ ./prometheus
  • From a local browser, point to localhost:9090
prometheus web-ui

Add a collector/scraper to monitor the OS

Prometheus itself does not do much apart from monitor itself, to do anything useful we have to add a scraper/exporter module. The easiest thing to do is add the scraper to monitor OS-X itself. As in Linux the OS exporter is simply called “node exporter”.

Start by downloading the pre-compiled darwin node exporter from prometheus.io

  • Unzip the tar.gz
  • cd into the directory
  • run the node exporter
$ cd /Users/gary.little/Downloads/node_exporter-0.18.1.darwin-amd64
$ ./node_exporter
 INFO[0000] Starting node_exporter (version=0.18.1, branch=HEAD, revision=3db77732e925c08f675d7404a8c46466b2ece83e)  source="node_exporter.go:156"
 INFO[0000] Build context (go=go1.11.10, user=root@4a30727bb68c, date=20190604-16:47:36)  source="node_exporter.go:157"
 INFO[0000] Enabled collectors:                           source="node_exporter.go:97"
 INFO[0000]  - boottime                                   source="node_exporter.go:104"
 INFO[0000]  - cpu                                        source="node_exporter.go:104"
 INFO[0000]  - diskstats                                  source="node_exporter.go:104"
 INFO[0000]  - filesystem                                 source="node_exporter.go:104"
 INFO[0000]  - loadavg                                    source="node_exporter.go:104"
 INFO[0000]  - meminfo                                    source="node_exporter.go:104"
 INFO[0000]  - netdev                                     source="node_exporter.go:104"
 INFO[0000]  - textfile                                   source="node_exporter.go:104"
 INFO[0000]  - time                                       source="node_exporter.go:104"
 INFO[0000] Listening on :9100                            source="node_exporter.go:170""
Continue reading

SQL Server uses only one NUMA Node with HammerDB

Some versions of HammerDB (e.g. 3.2) may induce imbalanced NUMA utilization with SQL Server.

This can easily be observed with Resource monitor. When NUMA imbalance occurs one of the NUMA nodes will show much larger utilization than the other. E.g.

Imbalanced NUMA usage by SQL Server.

The cause and fix is well documented on this blog. In short HammerDB issues a short lived connection, for every persistent connection. This causes the SQL Server Round-robin allocation to send all the persistent worker threads to a single NUMA Node! To resolve this issue, simply comment out line #212 in the driver script.

Comment out this line to work-around the HammerDB NUMA imbalance problem.

If successful you will immediately see that the NUMA nodes are more balanced. Whether this results in more/better performance will depend on exactly where the bottleneck is.

Balanced NUMA usage by SQL Server

How to run vertica vioperf tool

The vertica vioperf tool is used to determine whether the storage you are planning on using is fast enough to feed the vertica database. When I initially ran the tool, the IO performance reported by the tool and confirmed by iostat was much lower than I expected for the storage device (a 6Gbit SATA device capable of around 500MB/s read and write).

The vioperf tool runs on a linux host or VM and can be pointed at any filesystem just like fio or vdbench

Simple execution of vioperf writing to the location /vertica

vioperf --thread-count=8 --duration=120s  /vertica

Working Set Size

Unlike traditional IO generators vioperf does not allow you to specify the working-set size. The amount of data written is simply 1MB* Achieved IO rate * runtime. So, fast storage with long run-times will need a lot of capacity otherwise the tool simply fills the partition and crashes!

Measurement and goodness

The primary metric is MB/s Per-Core. The idea is that you give 1 Thread per core in the system, though there is nothing stopping you from using whatever –thread-count value you like.

Although the measure is throughput, the primary metric of (Throughput/Core) does not improve just by giving lots of concurrency. Concurrency is generated purely by the number of threads and since the measure of goodness is Throughput/Core (or per thread) it’s not possible to simply create throughput from concurrency alone.

Throughput compared to FIo

Compared to fio the reported throughput is lower for the same device and same degree of concurrency. Vertica continually writes, and extends the files so there is some filesystem work going on whereas fio is typically overwriting an existing file. If you observe iostat during the vioperf run you will see that the IO size to disk is different than what an fio run will generate. Again this is due to the fact that vioperf is continually extending the file(s) being written and so it needs to update filesystem metadata quite frequently. These small metadata updates skew the average IO size lower.

fio with 1MB IO and 1 thread

Notice the avgrq size is 1024 blocks (512KB) which is the maximum transfer size that this drive supports.

 fio --filename=/samsung/vertica/file --size=5g --bs=1m --ioengine=libaio --iodepth=1 --rw=write --direct=1 --name=samsung --create_on_open=0


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.16    0.00    3.40    0.00    0.00   92.43

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  920.00     0.00 471040.00  1024.00     1.40    1.53    0.00    1.53   1.02  93.80

Vertica IOstat 1 thread

Firstly we see that iostat reports much lower disk throughput than what we achieved with fio for the same offered workload (1MB IO size with 1 outstanding IO (1 thread).

Also notice that that although vioperf issues 1MB IO sizes (which we can see from strace) iostat does not report the same 1024 block transfers as we see when we run iostat during an fio run (as above).

In the vioperf case the small metadata writes that are needed to continually extend the file cause a average IO size than than overwriting an existing file. Perhaps that is the cause of the lower performance?

./vioperf --duration=300s --thread-count=1 /samsung/vertica

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.77    0.13    2.38    5.26    0.00   83.46

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  627.00     0.00 223232.00   712.06     1.02    1.63    0.00    1.63   0.69  43.20
strace -f ./vioperf --duration=300s --thread-count=1 --disable-crc /samsung/vertica
...
[pid  1350] write(6, "v\230\242Q\357\250|\212\256+}\224\270\256\273\\\366k\210\320\\\330z[\26[\6&\351W%D"..., 1048576) = 1048576
[pid  1350] write(6, "B\2\224\36\250\"\346\241\0\241\361\220\242,\207\231.\244\330\3453\206'\320$Y7\327|5\204b"..., 1048576) = 1048576
[pid  1350] write(6, "\346r\341{u\37N\254.\325M'\255?\302Q?T_X\230Q\301\311\5\236\242\33\1)4'"..., 1048576) = 1048576
[pid  1350] write(6, "\5\314\335\264\364L\254x\27\346\3251\236\312\2075d\16\300\245>\256mU\343\346\373\17'\232\250n"..., 1048576) = 1048576
[pid  1350] write(6, "\272NKs\360\243\332@/\333\276\2648\255\v\243\332\235\275&\261\37\371\302<\275\266\331\357\203|\6"..., 1048576) = 1048576
[pid  1350] write(6, "v\230\242Q\357\250|\212\256+}\224\270\256\273\\\366k\210\320\\\330z[\26[\6&\351W%D"..., 1048576) = 1048576
...

However, look closely and you will notice that the %user is higher than fio for a lower IO rate AND the disk is not 100% busy. That seems odd.

./vioperf --duration=300s --thread-count=1 /samsung/vertica

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.77    0.13    2.38    5.26    0.00   83.46

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  627.00     0.00 223232.00   712.06     1.02    1.63    0.00    1.63   0.69  43.20

vioperf with –disable-crc

Finally we disable the crc checking (which vioperf does by default) to get a higher throughput more similar to what we see with fio.

It turns out that the lower performance was not due to the smaller IO sizes (and additonal filesystem work) but was caused the CRC checking that the tool does to simulate the vertica application.

 ./vioperf --duration=300s --thread-count=1 --disable-crc /samsung/vertica

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.77    0.13    2.38    5.26    0.00   83.46

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  627.00     0.00 223232.00   712.06     1.02    1.63    0.00    1.63   0.69  43.20

Duplicate IP issues with Linux and virtual machine cloning.

TL;DR – Some modern Linux distributions use a newer method of identification which, when combined with DHCP can result in duplicate IP addresses when cloning VMs, even when the VMs have unique MAC addresses.

To resolve, do the following ( remove file, run the systemd-machine-id-setup command, reboot):

# rm /etc/machine-id
# systemd-machine-id-setup
# reboot

When hypervisor management tools make clones of virtual machines, the tools usually make sure to create a unique MAC address for every clone. Combined with DHCP, this is normally enough to boot the clones and have them receive a unique IP. Recently, when I cloned several Bitnami guest VMs which are based on Debian, I started to get duplicate IP addresses on the clones. The issue can be resolved manually by following the above procedure.

To create a VM template to clone from which will generate a new machine-id for very clone, simply create an empty /etc/machine-id file (do not rm the file, otherwise the machine-id will not be generated)

# echo "" |  tee /etc/machine-id 

The machine-id man page is a well written explanation of the implementation and motivation.

View from Nutanix storage during Postgres DB benchmark

Following on from the previous [1] [2] experiments with Postgres & pgbench. A quick look at how the workload is seen from the Nutanix CVM.

The Linux VM running postgres has two virtual disks:

  • One is taking transaction log writes.
  • The other is doing reads and writes from the main datafiles.

Since the database size is small (50% the size of the Linux RAM) – the data is mostly cached inside the guest, and so most reads do not hit storage. As a result we only see writes going to the DB files.

Additionally, we see that database datafile writes the arrive in a bursty fashion, and that these write bursts are more intense (~10x) than the log file writes.

Charts from Prometheus/Grafana showing IO rates seen from the perspective of the Linux guest VM

Despite the database flushes ocurring in bursts with a decent amount of concurrency the Nutanix CVM provides an average of 1.5ms write response time.

From the Nutanix CVM port 2009 handler, we can access the individual vdisk statistics. In this particular case vDisk 45269 is the data file disk, and 40043 is the database transaction log disk.

Datafile writes completed in 1.5millisecond average – despite deep queues during burst

The vdisk categorizer correctly identifies the database datafile write pattern as highly random.

Writes to the datbase datafiles are almost entirely random

As a result, the writes are passed into the replicated oplog

The burst of writes hits the oplog as expected

Meanwhile the log writes are categorized as mostly sequential, which is expected for a database log file workload.

Meanwhile, log file writes are mostly categorized as sequential.

Even though the log writes are sequential, they are low-concurrency and small size (looks like mostly 16K-32K). This write pattern is also a good candidate for oplog.

These low-concurrency log writes also hit oplog

Benchmarking with Postgres PT2

In this example we run pgbench with a scale factor of 1000 which equates to a database size of around 15GB. The linux VM has 32G RAM, so we don’t expect to see many reads.

Using prometheus with the Linux node exporter we can see the disk IO pattern from pgbench. As expected the write pattern to the log disk (sda) is quite constant, while the write pattern to the database files (sdb) is bursty.

pgbench with DB size 50% of Linux buffer cache.

I had to tune the parameter checkpoint_completion_target from 0.5 to 0.9 otherwise the SCSI stack became overwhelmed during checkpoints, and caused log-writes to stall.

default pgbench – notice the sharp drop in log-writes before tuning.


Benchmarking with Postgres PT1

Image By Daniel Lundin

In this example, we use Postgres and the pgbench workload generator to drive some load in a virtual machine.  Assume a Linux virtual machine that has Postgres installed. Specifically using a Bitnami virtual appliance.

  • Once the VM has been started, connect to the console
  • Allow access to postgres port 5432 – which is the postgres DB port or allow ssh
$ sudo ufw allow 5432
  • Note the postgres user password (cat ./bitnami_credentials)
  •  Login to psql from the console or ssh
psql -U postgres
  • Optionally change password (the password prompted is the one from bitnami_credentials for the postgres database user).
psql -U postgres
postgres=# alter user postgres with password 'NEW_PASSWORD';
postgresl=# \q
  • Create a DB to run the pgbench workload.  In this case I name the db pgbench-sf10 for “Scale Factor 10”.  Scale Factors are how the size of the database is determined.
$ sudo -u postgres createdb pgbench-sf10
  • Initialise the DB with data ready to run the benchmark.  The “createdb” step just creates an empty schema.
    • -i means “initialize”
    • -s means “scale factor” e.g. 10
    • pgbench-sf10 is the database schema to use.  We use the one just created pgbench-sf10
$ sudo -u postgres pgbench -i -s 10 pgbench-sf10
  • Noe run a workload against the DB schema called pgbench-sf10
$ sudo -u postgres pgbench pgbench-sf10

The workload pattern, and load on the system will vary greatly depending on the scale factor.  

Scale-Factor        Working Set Size


1                                   23M
10                                157M
100                             1.7GB
1000                          15GB
2500                          37GB
5000                         74GB
10000                       147GB

 

 

Install a bitnami image to Nutanix AHV cluster.

One of the nice things about using public cloud is the ability to use pre-canned application virtual appliances created by companies like Bitnami.

We can use these same appliance images on Nutanix AHV to easily do a Postgres database benchmark

Step 1. Get the bitnami image

wget  https://bitnami.com/redirect/to/587231/bitnami-postgresql-11.3-0-r56-linux-debian-9-x86_64.zip

Step 2. Unzip the file and convert the bitnami vmdk images to a single qcow2[1] file.

qemu-img convert *vmdk bitnami.qcow2

Put the bitnami.qcow2 image somewhere accessible to a browser, connected to the Prism service, then upload using the “Image Configuration”

Once the image is uploaded, it’s time to create a new VM based on that image

Once booted, you’ll see the bitnami logo and you can configure the bitnami passwords, enable ssh etc. using the console.

Enable/disable ssh in bitnami images
Connecting to Postgres in bitnami images
Note – when you “sudo -c postgres <some-psql-tool> the password it is asking for is the Postgres DB password (stored in ./bitnami-credentials) not any unix user password.

Once connected to the appliance we can use postgres and pgbench to generate simplistic database workload.

[1] Do this on a Linux box somewhere. For some reason the conversion failed using my qemu utilities installed via brew. Importing OVAs direct into AHV should be available in the future.

Paper: A Nine year study of filesystem and storage benchmarking

A 2007 paper, that still has lots to say on the subject of benchmarking storage and filesystems. Primarily aimed at researchers and developers, but is relevant to anyone about to embark on a benchmarking effort.

  • Use a mix of macro and micro benchmarks
  • Understand what you are testing, cached results are fine – as long as that is what you had intended.

The authors are clear on why benchmarks remain important:

Ideally, users could test performance in their own settings using real work- loads. This transfers the responsibility of benchmarking from author to user. However, this is usually impractical because testing multiple systems is time consuming, especially in that exposing the system to real workloads implies learning how to configure the system properly, possibly migrating data and other settings to the new systems, as well as dealing with their respective bugs.”

We cannot expect end-users  to be experts in benchmarking. It is out duty as experts  to provide the tools (benchmarks) that enable users to make purchasing decisions without requiring years of benchmarking expertise.

Performance gains for postgres on Linux with hugepages

For this experiment I am using Postgres v11 on Linux 3.10 kernel. The goal was to see what gains can be made from using hugepages. I use the “built in” benchmark pgbench to run a simple set of queries.

Since I am interested in only the gains from hugepages I chose to use the “-S” parameter to pgbench which means perform only the “select” statements. Obviously this masks any costs that might be seen when dirtying hugepages – but it kept the experiment from having to be concerned with writing to the filesystem.

Experiment

The workstation has 32GB of memory
Postgres is given 16GB of memory using the parameter

shared_buffers = 16384MB


pgbench creates a ~7.4gb database using a scale-factor of 500

pgbench -i -s 500

Run the experiment like this

$ pgbench -c 10 -S -T 600 -P 1 p gbench

Result

Default : No hugepages :
tps = 62190.452850 (excluding connections establishing)

2MB Hugepages
tps = 66864.410968 (excluding connections establishing)
+7.5% over default

1GB Hugepages
tps = 69702.358303 (excluding connections establishing)
+12% over default

Enabling hugepages

Getting the default hugepages is as easy as entering a value into /etc/sysctl.conf. To allow for 16GB of hugepages I used the value of 8400, followed by “sysctl -p”

[root@arches gary]# grep huge /etc/sysctl.conf 
vm.nr_hugepages = 8400
[root@arches gary]# sysctl -p

To get 1GB hugepages, the kernel has to have it configured during boot e.g.

[root@arches boot]# grep CMDLINE /etc/default/grub
GRUB_CMDLINE_LINUX="rd.lvm.lv=centos/swap vconsole.font=latarcyrheb-sun16 rd.lvm.lv=centos/root crashkernel=auto vconsole.keymap=us rhgb quiet rdblacklist=nouveau default_hugepagesz=1G hugepagesz=1G

Then reboot the kernel

I used these excellent resources
How to modify the kernel command line
How to enable hugepages
and this great video on Linux virtual memory



Nutanix AES: Performance By Example PT2

How to improve large DB read performance by 2X

Nutanix AOS 5.10 ships with a feature called Autonomous Extent Store (AES).  AES effectively provides Metadata Locality to complement the existing data locality that has always existed.  For large datasets (e.g. a 10TB database with 20% hot data) we observe a 2X improvement in throughput for random access across the 2TB hot dataset.

In our experiment we deliberately size the active working-set to NOT fit into the metadata cache.   We uniformly access 2TB with a 100% random access pattern and record the time to access all 2TB.  On the same hardware with AES enabled – the time is cut in half.  As can be seen in the chart – the throughput is double, as expected.

It is the localization of metadata from AES that contributes to the 2X improvement.  AES keeps most of the metadata local to the node – so there is no need to fetch data across-the-wire.  Additionally  AES reduces the need to cache metadata in DRAM since local access is so fast. For very large datasets, retrieving metadata can contribute a large proportion of the access time.  This is true for all storage, so speeding up metadata resolution can make a dramatic improvement to overall throughput as we demonstrate.

Nutanix AES: Performance By Example.

How to reduce database restore time by 50%


During .Next 2018 in London, Nutanix announced performance improvements in the core-datapath said to give up to 2X performance improvements.  Here’s a real-world example of that improvement in practice.

I am using X-Ray to simulate a 1TB data restore into an existing database.  Specifically the IO sizes are large, an even split of 64K,128K,256K, 1MB and the pattern is 100% random across the entire 1TB dataset.

bssplit=64k/20:128k/20:256k/20:512k/20:1m/20

Normally storage benchmarks using large IO sizes are performed serially, because it’s easier on the storage back-end.  That may be realistic for an initial load, but in this case we want to simulate a restore where the pattern is 100% random.

In this case the time to ingest 1TB drops by half when using Nutanix AOS 5.10 with Autonomous Extent Store (AES) enabled Vs the previous traditional extent store.

This improvement is possible because with AES, inserting directly into the extent store is much faster.

For throughput sensitive, random workloads, AES can detect that it will be faster to skip the oplog. Skipping oplog allows AES to eliminate a network round trip to a remote oplog – and instead only make an RF2 copy for the Extent Store.    By contrast, when sustained, large random IO is funneled into oplog, the 10Gbit network  can become the bottleneck.  Even with faster networks, AES will still be a benefit because the CPU and SSD resource usage is also lower.  Unfortunately I only have 10Gbit networking in my lab!

The X-Ray files needed to run this test are on github

How scalable is my Nutanix cluster really?

In a previous post I showed a chart which plots concurrency [X-axis] against throughput (IOPS) on the Y-Axis.  Here is that plot again:

Experienced performance chart ogglers will notice the familiar pattern of Littles Law, whereby throughput (X) rises quickly as concurrency (N) is increased.  As we follow the chart to the right, the slope flattens out and we achieve a lower increase in throughput, even as we increase concurrency by the same amount at each stage.  The flattening of the curve is best understood as Amdahls Law.

Anyone who follows Dr. Neil Gunther and his Universal Scalability Law, will also recognize this curve.

The USL states that taking the values of concurrency and throughput as inputs, we can in fact calculate the scalability of the system.  Specifically we are able to calculate the key factors of contention and crosstalk – which limit absolute linear scalability and eventually result in less throughput as additional load is submitted even as the capacity of the system is saturated.

I was fortunate to find both a very useful tool, and an easy-to-read summary of the USL from the Vivid Cortex site.  Both were written by Baron Schwartz.  I encourage anyone interested in scalability to check out his paper.

Using his Excel spreadsheet, I was able to input the numbers from my test and derive values that determine scalability.

Taking the largest number (0.074%)  the “contention value” (i.e the impact we expect due to Amdahls law) as the limit to linear scaling – we can say that for this particular cluster, running this particular (simplistic/synthetic) workload the Nutanix cluster scales 99.926% linear.  Although I did not crank up the concurrency beyond 576, the model shows us that this cluster will start to degrade performance if we try to push concurrency beyond 600 or so.  Again, the USL model is for this particular workload – on this particular cluster.  Doubling the concurrency of the offered load to 1200 will only net us 500,000 IOPS according to the model.

The high linearity (99.926%) is expected. The workload is 100% read, and with the data-locality feature of Nutanix filesystem – we expect close to 100% scalability.

We will return to these measures of scalability in the future to look at more realistic workloads.

Here is the Excel Sheet with my data : VividCortex_USL_Worksheet_v1 You are here

 

Working with fio “distribution /pereto” parameter

The fio Pareto parameter allows us to create a workload, which references a very large dataset, but specify a hotspot for the access pattern.  Here’s an example using the same setup as the ILM experiment, but using a Pareto value of 0:8.  My fio file looks like this..

[global]
ioengine=libaio
direct=1
time_based
norandommap
random_distribution=pareto:0.8
The experiment shows that with the access pattern as a Pareto ratio 0:8, meaning 20% of the overall dataset is “hot” the ILM process happens much faster as the hotspot is smaller, and is identified faster than a 100% uniform random access pattern.  We would expect a similar shape for any sort of caching mechanism.

You are here. The art of HCI performance testing

At some point potential Hyper-converged infrastructure (HCI) users want to know – “How fast does this thing go?”.  The real question is “how do we measure that?”.

The simplest test is to run a single VM, with a single disk and issue a single IO at a time.  We see often see this sort of test in bake-offs, and such a test does answer an important question – “what’s the lowest possible response time I can expect from the storage”.

However, this test only gives a single data point.  Since nobody purchases a HCI cluster to run a single VM,  we also need to know what happens when multiple VMs are run at the same time.  This is a much more difficult test to conduct, and many end-users lack access and experience with tools that can give the full picture.

In the example below, the single VM, single vdisk, single IO result is at the very far left of the chart.  Since it’s impossible to read I will tell you that the result is about 2,500 IOPS at ~400 microseconds.  (in fact we know that if the IOPS are 2,500 the response time MUST be 400 microseconds 1/2,500 == .0004 seconds)

However with a single VM, the cluster is mostly idle, and has capacity to do much more work.   In this X-Ray test I add another worker VM doing the exact same workload pattern to every node in the cluster every 5 minutes.

By the time we reach the end of the test, the total IOPS have increased to around 600,000 and the response time only increased by an additional 400 microseconds.

In other words the cluster was able to achieve 240X the amount of work measured by the single VM on a single node with only a 2X increase in response time, which is still less than 1ms.


The overall result is counter-intuitive, because the rate of change in IOPS (240X) is way out of line with the increase in response time (2X).  The single VM test is using only a fraction of the cluster capacity.

When comparing  HCI clusters to traditional storage arrays – you should expect the traditional array to outperform the cluster at the far left of the chart, but as work scales up the latent capacity of the HCI cluster is able to provide huge amounts of IO with very low response times.

You can run this test yourself by adding this custom workload to X-Ray

X-Ray scenario to demonstrate Nutanix ILM behavior.

Specifically a customer wanted to see how performance changes (and how quickly) as data moves from HDD to SSD automatically as data is accessed.  The access pattern is 100% random across the entire disk.

In a hybrid Flash/HDD system – “cold” data (i.e. data that has not been accessed for a long time) is moved from SSD to HDD when the SSD capacity is exhausted.  At some point in the future – that same data may become “hot” again, and so we want to make sure that the “newly hot” data is quickly moved back to the SSD tier.  The duration of the above chart is around 5 minutes – and we see that by, around the 3 minute mark the entire dataset is resident on the SSD tier.

This X-ray test uses a couple of neat tricks to demonstrate ILM behavior.

  • Edit container preferences to send sequential data immediately to HDD
  • Overwriting data with NUL/Zero bytes frees the underlying data on Nutanix filesystem

To demonstrate ILM from HDD to SSD (ad ultimately into the DRAM cache on the CVM) we first have to ensure that we have data on the HDD in the first place.  By default Nutanix OS will always try to write new data to SSD.  To circumvent that behavior we can edit the container preferences.  We use the fact that the “prefill” will be a sequential workload, while the measured workload will be a random workload. To make the change, use “ncli” to change the ” Sequential I/O Pri Order” to be HDD only.



In my case I happened to call my container “xray” since I didn’t want to change the default container. Now, when X-Ray executes the prefill stage, the data will land on HDD. As a second requirement, we want to see what happens when IO with different size blocks are issued so that we can get a chart similar to this: To achieve the desired behavior, we need to make sure that, at the beginning of each test, the data, again resides on HDD.  The problem is that the data is up-migrated during the test. To do this we do an initial overwrite of the entire disk with “NULL” bytes using  a parameter in fio “zero_buffers”.  This causes the data to be freed on the Nutanix filesystem.  Then we issue a normal profile with random data. Once the data is freed, then we know that the new initial writes will go to HDD – because we edited the container to do so. The overall test pattern looks like this

  • Create and clone VMs
  • Prefill with random data (Data will reside on HDD due to container edit)
  • Read disk with 16KB block size
  • Zero out the disks – to remove/free the up-migrated data
  • Prefill the disks with Radom data
  • Read disk with 32KB block size
  • Zero out the disks
  • Prefill with random daa
  • Read disk with 64KB block size

I have uploaded this x-ray test to GitHub : X-Ray Up-Migration test