Using Prometheus and Grafana to monitor a Nutanix Cluster.

Posted on May 17, 2024June 10, 2024 by gary

Using a small python script we can liberate data from the “Analysis” page of prism element and send it to prometheus, where we can combine cluster metrics with other data and view them all on some nice Grafana dashboards.

Continue reading →

Viewing Nutanix cluster metrics in prometheus/grafana

Posted on July 31, 2023August 1, 2023 by gary

Using Nutanix API with prometheus push-gateway.

Many customers would like to view their cluster metrics alongside existing performance data using Prometheus/Grafana

Currently Nutanix does not provide a native exporter for Prometheus to use as a datasource. However we can use the prometheus push-gateway and a simple script which pulls from the native APIs to get data into prometheus. From there we can use Grafana or anything that can connect to Prometheus.

The goal is to be able to view cluster metrics alongside other Grafana dashboards. For example show the current Read/Write IOPS that the cluster is delivering on a per container basis. I’m hard-coding IPs and username/passwords in the script which obviously is not production grade, so don’t do that.

Continue reading →

Effects of CPU topology on sqlserver guests with AHV.

Posted on February 28, 2023February 28, 2023 by gary

VM CPU Topology

The topology (layout) that AHV presents virtual Sockets/CPU to the guest operating system will usually be different than the physical topology. This is expected because we typically present a subset of all cores to the guest VMs.

Usually it is the total number of vCPU given to the VM that matters, not the specific topology, but in the case of SQLserver running an analytical workload (a TPC-H like workload from HammerDB) the topology passed to the VM does make a difference. Between 10% and 20% when measured by the total runtime.

[I think that the reason we see a difference here is that (a) the analytical workloads use hardly any storage bandwidth (I sized the database to fit in memory) and (b) there is probably a lot of cross-talk between the cores/memory as the DB engine issues parallel queries.]

At any rate we see that passing 20 cores as “20 sockets of 1 core” beats the performance of “1 socket with 20 cores” by a wide margin. The physical topology is two sockets of 20 cores on each socket. Thankfully the better performing option is the default.

CPU Topology may make a difference for SQL server running analytical workloads.

Continue reading →

Beware of tiny working-set-sizes when testing storage performance.

Posted on July 1, 2022September 7, 2022 by gary

I was recently asked to investigate why Nutanix storage was not as fast as a competing solution in a PoC environment. When I looked at the output from diskspd, the data didn’t quite make sense.

Continue reading →

Comparing RDS and Nutanix Cluster performance with HammerDB

Posted on June 14, 2022January 3, 2023 by gary

tl;dr

In a recent experiment using Amazon RDS instance and a VM running in an on-prem Nutanix cluster, both using Skylake class processors with similar clock speeds and vCPU count. The SQLServer database on Nutanix delivered almost 2X the transaction rate as the same workload running on Amazon RDS.

It turns out that migrating an existing SQLServer VM to RDS using the same vCPU count as on-prem may yield only half the expected performance for CPU heavy database workloads. The root cause is how Amazon thinks about vCPU compared to on-prem.

Benchmark Results

Continue reading →

AOS 6.1 Improvements for Day-2 database operations.

Posted on March 8, 2022January 3, 2023 by gary

AOS 6.1 greatly improved database performance on Nutanix especially when the guest VM uses just a single disk for all the database files. The underlying change is known as vdisk sharding. Basically it allows the Nutanix CVM to scale up the number of threads used to service a single virtual disk under heavy load.

Continue reading →

How to performance test Nutanix on AWS with X-ray

Posted on August 18, 2020April 2, 2021 by gary

End to End Creation of a Nutanix Cluster on AWS and Running X-Ray

Continue reading →

How to download and Install Nutanix X-ray on an AHV cluster

Posted on August 10, 2020April 2, 2021 by gary

Impact of Data locality on DB workloads.

Posted on July 23, 2019October 8, 2020 by gary

Do database workloads benefit from data locality?

Continue reading →

Install a bitnami image to Nutanix AHV cluster.

Posted on June 28, 2019June 28, 2019 by gary

One of the nice things about using public cloud is the ability to use pre-canned application virtual appliances created by companies like Bitnami.

We can use these same appliance images on Nutanix AHV to easily do a Postgres database benchmark

Step 1. Get the bitnami image

wget  https://bitnami.com/redirect/to/587231/bitnami-postgresql-11.3-0-r56-linux-debian-9-x86_64.zip

Step 2. Unzip the file and convert the bitnami vmdk images to a single qcow2[1] file.

qemu-img convert *vmdk bitnami.qcow2

Put the bitnami.qcow2 image somewhere accessible to a browser, connected to the Prism service, then upload using the “Image Configuration”

Once the image is uploaded, it’s time to create a new VM based on that image

Once booted, you’ll see the bitnami logo and you can configure the bitnami passwords, enable ssh etc. using the console.

Enable/disable ssh in bitnami images
Connecting to Postgres in bitnami images
Note – when you “sudo -c postgres <some-psql-tool> the password it is asking for is the Postgres DB password (stored in ./bitnami-credentials) not any unix user password.

Once connected to the appliance we can use postgres and pgbench to generate simplistic database workload.

[1] Do this on a Linux box somewhere. For some reason the conversion failed using my qemu utilities installed via brew. Importing OVAs direct into AHV should be available in the future.

X-Ray scenario to demonstrate Nutanix ILM behavior.

Posted on October 25, 2018July 13, 2019 by gary

Specifically a customer wanted to see how performance changes (and how quickly) as data moves from HDD to SSD automatically as data is accessed. The access pattern is 100% random across the entire disk.

In a hybrid Flash/HDD system – “cold” data (i.e. data that has not been accessed for a long time) is moved from SSD to HDD when the SSD capacity is exhausted. At some point in the future – that same data may become “hot” again, and so we want to make sure that the “newly hot” data is quickly moved back to the SSD tier. The duration of the above chart is around 5 minutes – and we see that by, around the 3 minute mark the entire dataset is resident on the SSD tier.

This X-ray test uses a couple of neat tricks to demonstrate ILM behavior.

Edit container preferences to send sequential data immediately to HDD
Overwriting data with NUL/Zero bytes frees the underlying data on Nutanix filesystem

To demonstrate ILM from HDD to SSD (ad ultimately into the DRAM cache on the CVM) we first have to ensure that we have data on the HDD in the first place. By default Nutanix OS will always try to write new data to SSD. To circumvent that behavior we can edit the container preferences. We use the fact that the “prefill” will be a sequential workload, while the measured workload will be a random workload. To make the change, use “ncli” to change the ” Sequential I/O Pri Order” to be HDD only.

In my case I happened to call my container “xray” since I didn’t want to change the default container. Now, when X-Ray executes the prefill stage, the data will land on HDD. As a second requirement, we want to see what happens when IO with different size blocks are issued so that we can get a chart similar to this: To achieve the desired behavior, we need to make sure that, at the beginning of each test, the data, again resides on HDD. The problem is that the data is up-migrated during the test. To do this we do an initial overwrite of the entire disk with “NULL” bytes using a parameter in fio “zero_buffers”. This causes the data to be freed on the Nutanix filesystem. Then we issue a normal profile with random data. Once the data is freed, then we know that the new initial writes will go to HDD – because we edited the container to do so. The overall test pattern looks like this

Create and clone VMs
Prefill with random data (Data will reside on HDD due to container edit)
Read disk with 16KB block size
Zero out the disks – to remove/free the up-migrated data
Prefill the disks with Radom data
Read disk with 32KB block size
Zero out the disks
Prefill with random daa
Read disk with 64KB block size

I have uploaded this x-ray test to GitHub : X-Ray Up-Migration test

HCI Performance testing made easy (Part 4)

Posted on September 27, 2018October 4, 2018 by gary

What happens when power is lost to all nodes of a HCI Cluster?

Ever wondered what happens when all power is simultaneously lost on a HCI cluster? One of the core principles of cloud design is that components are expected to fail, but the cluster as a whole should stay “up”. We wanted to see what happens when all components fail at once, so we designed an X-Ray test to do exactly that.

We start an OLTP workload on every node in the cluster, then X-Ray connects to the IPMI port on each node, and powers off all the hosts while the cluster is under load. In particular, the cluster is under read/write load (we need write workload, because we want to force the cluster to recover in-flight writes).

After power-off, we wait 10 seconds for everything to spin down, then immediately re-apply the power by connecting to the IPMI ports.

The nodes power up, and immediately start their POST (Power On Self Test) and boot the hypervisor. The CVM will auto-start, discover the available nodes and form the cluster.

X-Ray polls the cluster manager (either Prism or vCenter) to determine that the cluster is “up” and then restarts the OLTP workload.

Our testing showed that our Nutanix cluster completed POST, and was ready to restart work in around 10 minutes. Moreover, the time to achieve the recovery had very little variability. The chart below shows three separate runs on the same cluster.

This is the YAML file which defines the workload. The full specification is on github. The key part of the YAML is the nodes.PowerOff which connects to the IMPI ports of each node and vm_group.WaitForPowerOn which connects to either Nutanix Prism or vmware vCenter and determines that the cluster is formed, and ready to accept new work.

The return of misaligned IO

Posted on May 25, 2017January 3, 2023 by gary

We have started seeing misaligned partitions on Linux guests runnning certain HDFS distributions. How these partitions became mis-aligned is a bit of a mystery, because the only way I know how to do this on Linux is to create a partition using old DOS format like this (using -c=dos and -u=cylinders) Continue reading →

Creating compressible data with fio.

Posted on October 18, 2016July 13, 2019 by gary

Today I used fio to create some compressible data to test on my Nutanix nodes. I ended up using the following fio params to get what I wanted.

buffer_compress_percentage=50
refill_buffers
buffer_pattern=0xdeadbeef

buffer_compress_percentage does what you’d expect and specifies how compressible the data is
refill_buffers Is required to make the above compress percentage do what you’d expect in the large. IOW, I want the entire file to be compressible by the buffer_compress_percentage amount
buffer_pattern This is a big one. Without setting this pattern, fio will use Null bytes to achieve compressibility, and Nutanix like many other storage vendors will suppress runs of Zero’s and so the data reduction will mostly be from zero suppression rather than from compression.

Much of this is well explained in the README for latest version of fio.

Also NOTE Older versions of fio do not support many of the fancy data creation flags, but will not alert you to the fact that fio is ignoring them. I spent quite a bit of time wondering why my data was not compressed, until I downloaded and compiled the latest fio.

SATA on Nutanix. Some experimental data.

Posted on June 16, 2015January 3, 2023 by gary

The question of why Nutanix uses SATA drive comes up sometimes, especially from customers who have experienced very poor performance using SATA on traditional arrays.

I can understand this anxiety. In my time at NetApp we exclusively used SAS or FC-AL drives in performance test work. At the time there was a huge difference in performance between SCSI and SATA. Even a few short years ago, FC typically spun at 15K RPM whereas SATA was stuck at about a 5K RPM, so experiencing 3X the rotational delay.

These days SAS and SATA are both available in 7200 RPM configurations, and these are the type we use in standard Nutanix nodes. In fact the SATA drives that we use are marketed by Seagate as “Nearline SAS” or NL-SAS. Mainly to differentiate them from the consumer grade SATA drives that are found in cheap laptops. There are hundreds of SAS Vs SATA articles on the web, so I won’t go over the theoretical/historical arguments.

SATA in Hybrid/Tiered Storage

In a Nutanix cluster the “heavy lifting” of IO is mainly done by the SSD’s – leaving the SATA drives to service the few remaining IO’s that miss the SSD tier. Under moderate load, the SATA spindles do pretty well, and since the SATA $/GB is only 60% of SAS. SATA seems like a good choice for mostly-cold data.

Let’s Experiment.

From a performance perspective, I decided to run a few experiments to see just how well SATA performs. In the test, the SATA drives are Nutanix standard drives “ST91000640NS” (Seagate, priced around $150). The comparable SAS drives are the same form-factor (2.5 Inch) “AL13SEB900” (Toshiba, priced at about $250 USD). These drives spin at 10K RPM. Both drives hold around 1TB.

There are three experiments per drive type to reveal the impact of seek-times. This is achieved using the “filesize” parameter of fio – which determines the LBA range to read. One thing to note, is that I use a queue-depth of one. Therefore IOPs can be calculated as simply 1/Response-Time (converted to seconds).

[global]
bs=8k
rw=randread 
iodepth=1 
ioengine=libaio 
time_based 
runtime=10 
direct=1 
filesize=1g 

[randread]
filename=/dev/sdf1

Random Distribution. SATA Vs SAS

Working Set Size	7.2K RPM SATA Response Time (ms)	10K RPM SAS Response Time (ms)
1 GB	5.5	4
100 GB	7.5	4.5
1000 GB	12.5	7

Zipf Distribution. SATA Only.

Working Set Size	Response Time (ms)
1000G	8.5

Somewhat intuitively as the working-set (seek) gets larger, the difference between “Real SAS” and “NL-SAS/SATA” gets wider. This is intuitive because with a 1GB working-set, the seek-time is close to zero, and so only the rotational delay (based on RPM) is a factor. In fact the difference in response time is the same as the difference in rotational speed (1:1.3).

Also (just for fun) I used the “random_distribution=zipf” function in fio to test the response time when reading across the entire range of the disk – but with a “hotspot” (zipf) rather than a uniform random read – which is pretty unrealistic.

In the “realistic” case – reading across the entire disk on the SATA drives shipped with Nutanix nodes is capable of 8.5 ms response time at 125 IOPS per spindle.

Conclusion

The performance difference between SAS and SATA is often over-stated. At moderate loads SATA performs well enough for most use-cases. Even when delivering fully random IO over the entirety of the disk – SATA can deliver 8K in less than 15ms. Using a more realistic (not 100% random) access pattern the response time is < 10ms.

For a properly sized Nutanix implementation, the intent is to service most IO from Flash. It’s OK to generate some work on HDD from time-to-time even on SATA.

Designing a scaleout storage platform.

Posted on June 12, 2014March 5, 2017 by gary

I was speaking to one of our developers the other day, and he pointed me to the following paper: SEDA: An Architecture for Well-Conditioned, Scalable Internet Services as an example of the general philosophy behind the design of the Nutanix Distributed File System (NDFS).

Although the paper uses examples of both a webserver and a gnutella client, the philosophies are relevant to a large scale distributed filesystem. In the case of NDFS we are serving disk blocks to clients who happen to be virtual machines. One trade-off that is true in both cases is that scalability is traded for low latency in the single-stream case. However at load, the response time is generally better than a system that is designed to low-latency, and then attempted to scale-up to achive high throughput.

At Nutanix we often talk about web-scale architectures, and this paper gives a pretty solid idea of what that might mean in concrete terms.

FWIW., according to google scholar, the paper has been cited 937 times, including Cassandra which is how we store filesystem meta-data in a distributed fashion.

n0derunner

nutanix

Using Prometheus and Grafana to monitor a Nutanix Cluster.

Viewing Nutanix cluster metrics in prometheus/grafana

Using Nutanix API with prometheus push-gateway.

Effects of CPU topology on sqlserver guests with AHV.

VM CPU Topology

Beware of tiny working-set-sizes when testing storage performance.

Comparing RDS and Nutanix Cluster performance with HammerDB

tl;dr

Benchmark Results

AOS 6.1 Improvements for Day-2 database operations.

How to performance test Nutanix on AWS with X-ray

How to download and Install Nutanix X-ray on an AHV cluster

Impact of Data locality on DB workloads.

Install a bitnami image to Nutanix AHV cluster.

X-Ray scenario to demonstrate Nutanix ILM behavior.

HCI Performance testing made easy (Part 4)

What happens when power is lost to all nodes of a HCI Cluster?

The return of misaligned IO

Creating compressible data with fio.

SATA on Nutanix. Some experimental data.

SATA in Hybrid/Tiered Storage

Let’s Experiment.

Random Distribution. SATA Vs SAS

Zipf Distribution. SATA Only.

Conclusion

Designing a scaleout storage platform.