Using Prometheus and Grafana to monitor a Nutanix Cluster.

Using a small python script we can liberate data from the “Analysis” page of prism element and send it to prometheus, where we can combine cluster metrics with other data and view them all on some nice Grafana dashboards.

Output

Prism Analysis PageGrafana Dashboard
Prism Analysis Page Vs Grafana Dashboard

method

The method we will use here is to create a python script which pulls stats from Prism Element via API – and exposes them in a format that Prometheus can consume. The available metrics expose many interesting details – but are updated only every 20-30 seconds. This is enough to do trending, and fits nicely with the typical Prometheus scrape interval.

The metrics are aggregated under three buckets. Per VM, Per Storage Container, Per Host and Cluster-wide.

Below is an example of creating the Per VM CPU panel – we divide the PPM metric by 10,000 to get a % which is what we see on the analysis page in Prism.

Useful metrics

Within these groupings/aggregations I have found the following metrics to be most useful to monitor resource usage on my test cluster. For CPU usage, the API seems to return what you would expect. e.g. for a VM – we get the % of provisioned vCPU used – and for the host we get the % of Physical CPU used.

Metric namemetric description
controller_num_iopsOverall IO rate per VM, container, host or cluster per second
controller_io_bandwidth_kBpsOverall throughput per VM, container, host or cluster in Kilobytes per second
controller_avg_io_latency_usecsAverage IO response time per VM, container, host or cluster in microseconds
hypervisor_cpu_usage_ppmCPU usage expressed as parts per million (divide by 10,000) to get %
Some useful metrics

Python Script

Run the below python code and supply the IP of a CVM (or the cluster VIP) a username and password. e.g I save the code below as entity_stats.py. Then point a prometheus scraper at port 8000 wherever this code is running. The heavy lifting is done by the promethus_client python module.

$ python ./entity_stats.py --vip <CVM_IP_ADDRESS> --username admin --password <password>
import requests
from requests.auth import HTTPBasicAuth
import json
import pprint
import prometheus_client
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway,Info
from prometheus_client import start_http_server, Summary
import time
import random
import argparse
import os

global filter_spurious_response_times,spurious_iops_threshold

#Attempt to filter spurious respnse time values for very low IO rates
filter_spurious_response_times=True
spurious_iops_threshold=50
            
# Entity Centric groups the stats by entities (e.g. vms, containers, hosts) - counters are labels
def main():
    global username,password
    parser=argparse.ArgumentParser()
    parser.add_argument("-v","--vip",action="store",help="The Virtual IP for the cluster",default=os.environ.get("VIP"))
    parser.add_argument("-u","--username",action="store",help="The prism username",default=os.environ.get("PRISMUSER"))
    parser.add_argument("-p","--password",action="store",help="The prism password",default=os.environ.get("PRISMPASS"))
    args=parser.parse_args()
    vip=args.vip
    username=args.username
    password=args.password
    if not (vip and username and password):
        print("Need a vip, username and password")
        exit(1)
    
    check_prism_accessible(vip)
    #Instantiate the prometheus guages to store metrics
    setup_prometheus_endpoint_entity_centric()
    #Start prometheus end-point on port 8000 after the Gauges are instantiated        
    start_http_server(8000)

    #Loop forever getting the metrics for all available entities (They may come and go)
    #then expose the metrics for those entities on prometheus exporter ready for scraping
    while(True):
        for family in ["containers","vms","hosts","clusters"]:
            entities=get_entity_names(vip,family)
            push_entity_centric_to_prometheus(family,entities)
        #The counters are meant for trending and are quite coarse
        #10s of  seconds is a reasonable scrape interval    
        time.sleep(10)

def setup_prometheus_endpoint_entity_centric():
    #
    # Setup gauges for VMs Hosts and Containers
    #
    global gVM,gHOST,gCTR,gCLUSTER
    prometheus_client.instance_ip_grouping_key()
    gVM = Gauge('vms', 'Stats grouped by VM',labelnames=['vm_name','metric_name'])
    gHOST = Gauge('hosts', 'Stats grouped by Pysical Host',labelnames=['host_name','metric_name'])
    gCTR = Gauge('containers', 'Stats grouped by Storage Container',labelnames=['container_name','metric_name'])
    gCLUSTER = Gauge('cluster','Stats grouped by cluster',labelnames=['cluster_name','metric_name'])


def push_entity_centric_to_prometheus(family,entities):

    if family == "vms":
        gGAUGE=gVM
    if family == "containers":
        gGAUGE=gCTR
    if family == "hosts":
        gGAUGE=gHOST
    if family == "clusters":
        gGAUGE=gCLUSTER
    
     #Get data from the dictionary passed in and set the gauges
    for entity in entities:
            #Each family may use a different identifier for the entity name.
            if family == "containers":
                entity_name=entity["name"]
            if family == "vms":
                entity_name=entity["vmName"]
            if family == "hosts":
                entity_name=entity["name"]
            if family == "clusters":
                entity_name=entity["name"]
            # regardless of the family, the stats are always stored in a  
            # structure called stats.  Within the stats structure the data 
            # is layed out as Key:Value.  We just walk through make a prometheus
            # guage for whatever we find
            for metric_name in entity["stats"]:
                stat_value=entity["stats"][metric_name]
                if any(prefix in metric_name for prefix in ["controller","hypervisor","guest"]):
                    print(entity_name,metric_name,stat_value)
                    gid=gGAUGE.labels(entity_name,metric_name)
                    gid.set(stat_value)
            #Overwrite value with -1 if IO rate is below spurious IO rate threshold.  This is 
            #to avoid misleading response times for entities that are doing very little IO
            if filter_spurious_response_times:
                print("Supressing spurious values - entity centric - family",entity_name,family)
                read_rate_iops=entity["stats"]["controller_num_read_iops"]
                write_rate_iops=entity["stats"]["controller_num_write_iops"]
                rw_rate_iops=entity["stats"]["controller_num_iops"]

                if (int(read_rate_iops)<spurious_iops_threshold):
                    print("read iops too low, supressing write response times")
                    gGAUGE.labels(entity_name,"controller_avg_read_io_latency_usecs").set("-1")
                if (int(write_rate_iops)<spurious_iops_threshold):
                    print("write iops too low, supressing write response times")
                    gGAUGE.labels(entity_name,"controller_avg_write_io_latency_usecs").set("-1")
                if (int(rw_rate_iops)<spurious_iops_threshold):
                    print("RW iops too low, supressing write response times")
                    gGAUGE.labels(entity_name,"controller_avg_io_latency_usecs").set("-1")

def get_entity_names(vip,family):
    requests.packages.urllib3.disable_warnings()
    v1_stat_VM_URL="https://"+vip+":9440/PrismGateway/services/rest/v1/"+family+"/"
    response=requests.get(v1_stat_VM_URL, auth=HTTPBasicAuth(username,password),verify=False)
    response.raise_for_status()
    result=response.json()
    entities=result["entities"]
    return entities

def check_prism_accessible(vip):
    #Check name resolution
    url="http://"+vip

    status = None
    message = ''
    try:
        resp = requests.head('http://' + vip)   
        status = str(resp.status_code)
    except:
        if ("[Errno 11001] getaddrinfo failed" in str(vip) or     # Windows
            "[Errno -2] Name or service not known" in str(vip) or # Linux
            "[Errno 8] nodename nor servname " in str(vip)):      # OS X
            message = 'DNSLookupError'
        else:
            raise
    return url, status, message

if __name__ == '__main__':
    main()

A Nutanix / Prometheus exporter in bash

Overview

For a fun afternoon project, how about a retro prometheus exporter using Apache/nginx, cgi-bin and bash!?

About prometheus format

A Prometheus exporter simply has to return a page with metric names and metric values in a particular format like below.

ntnx_bash{metric="cluster_read_iops"} 0
ntnx_bash{metric="cluster_write_iops"} 1

When you configure prometheus via prometheus.yml you’re telling prometheus to visit a particular IP:Port over HTTP and ask for a page called metrics – so if the “page” called metrics is a script – the script just has to return (print) out data in the expected format – and prometheus will accept that as a basic “exporter”. The idea here is to write a very simple exporter in bash that connects to a Nutanix cluster – hits the stats API and returns IOPS data for a given container in the correct format.

Continue reading

Linux memory monitoring (allocations Vs usage)

How to use some of Linux’s standard tools and how different types of memory usage shows up.

Examples of using malloc and writing to memory with three use-cases for a simple process

In each case we run the example with a 64MB allocation so that we can see the usage from standard linux tools.

We do something like this

gary@linux:~/git/unixfun$ ./malloc_and_write 65536
Allocating 65536 KB
Allocating 67108864 bytes
The address of your memory is 0x7fa2829ff010
Hit <return> to exit
Continue reading

Using iperf multi-stream may not work as expected

Running iperf with parallel threads

TL;DR – When running iperf with parallel threads/workers the -P option must be specified after the -c <target-IP> option. This is mentioned in the manpage but some options (-t for instance) work in any order, while others (specifically the -P for parallel threads) definitely does not, which is a bit confusing.

For example – these two invocations of iperf give very different results

  • iperf -P 5 -c 10.56.68.97 (The -P before -c) -Yields 20.4 Gbits/sec
  • iperf -c 10.56.68.97 -P 5 (The -P after the -c)- Yields 78.3 Gbits/sec
Continue reading

mpstat has an option to show utilization per NUMA node

Not sure how long this has been a thing, but I recently discovered that mpstat takes a -N option for “NUMA Node” that works in the same way as -P for “Processor”. e.g. $ mpstat -N 0,1 1 will show stats for NUMA nodes 0 and 1 every 1 second. Just like mpstat -P ALL shows all processors mpstat -N ALL shows all NUMA nodes (and is easier to type).

The output looks like this

05:09:17 PM NODE    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
05:09:18 PM 0 1.13 0.00 9.30 0.00 0.28 0.15 0.00 31.78 0.00 57.21
05:09:18 PM 1 0.40 0.00 8.03 0.00 0.28 1.05 0.00 31.34 0.00 58.78
^C

Average: NODE %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
Average: 0 0.80 0.00 8.56 0.00 0.27 0.11 0.00 36.49 0.00 53.78
Average: 1 0.49 0.00 10.02 0.00 0.32 1.01 0.00 25.13 0.00 63.03

Using mpstat -N is is quite easy to check to see how the workload is distributed among the NUMA nodes of a multi-socket machine.

Running the ML-Perf Storage benchmark on Nutanix files.

Some technical notes on our submission to the benchmark committee.

Background

For the past few months engineers from Nutanix have been participating in the MLPerftm Storage benchmark which is designed to measure the storage performance required for ML training workloads.  

We are pleased with how well our general-purpose file-server has delivered against the demands of this high throughput workload.  

Benchmark throughput and dataset
  • 125,000 files
  • 16.7 TB of data around 30% of the usable capacity (no “short stroking”)
  • Filesize 57-213MB per file
  • NFS v4 over Ethernet
  • 5GB/s delivered per compute node from single NFSv4 mountpoint
  • 25GB/s delivered across 5 compute nodes from single NFSv4 mountpoint

The dataset was 125,000 files consuming 16.7 TB, the file sizes ranged from 57 MB to 213 MB. ┬áThere is no temporal or spatial hotspot (meaning that the entire dataset is read) so there is no opportunity to cache the data in DRAM – the data is being accessed from NVMe flash media. For our benchmark submission we used standard NFSv4, standard Ethernet using the same Nutanix file-serving software that already powers everything from VDI users home-directories, medical images and more. No infiniband or special purpose parallel filesystems were harmed in this benchmark submission.

Continue reading

Viewing Nutanix cluster metrics in prometheus/grafana

Using Nutanix API with prometheus push-gateway.

Many customers would like to view their cluster metrics alongside existing performance data using Prometheus/Grafana

Currently Nutanix does not provide a native exporter for Prometheus to use as a datasource. However we can use the prometheus push-gateway and a simple script which pulls from the native APIs to get data into prometheus. From there we can use Grafana or anything that can connect to Prometheus.

The goal is to be able to view cluster metrics alongside other Grafana dashboards. For example show the current Read/Write IOPS that the cluster is delivering on a per container basis. I’m hard-coding IPs and username/passwords in the script which obviously is not production grade, so don’t do that.

Continue reading

Effects of CPU topology on sqlserver guests with AHV.

VM CPU Topology

The topology (layout) that AHV presents virtual Sockets/CPU to the guest operating system will usually be different than the physical topology. This is expected because we typically present a subset of all cores to the guest VMs.

Usually it is the total number of vCPU given to the VM that matters, not the specific topology, but in the case of SQLserver running an analytical workload (a TPC-H like workload from HammerDB) the topology passed to the VM does make a difference. Between 10% and 20% when measured by the total runtime.

[I think that the reason we see a difference here is that (a) the analytical workloads use hardly any storage bandwidth (I sized the database to fit in memory) and (b) there is probably a lot of cross-talk between the cores/memory as the DB engine issues parallel queries.]

At any rate we see that passing 20 cores as “20 sockets of 1 core” beats the performance of “1 socket with 20 cores” by a wide margin. The physical topology is two sockets of 20 cores on each socket. Thankfully the better performing option is the default.

CPU Topology may make a difference for SQL server running analytical workloads.
Continue reading

fio versions < 3.3 may show inflated random write performance

TL;DR

If your storage system implements inline compression, performance results with small IO size random writes with time_based and runtime may be inflated with fio versions < 3.3 due to fio generating unexpectedly compressible data when using fio’s default data pattern. Although unintuitive, performance can often be increased by enabling compression especially if the bottleneck is on the storage media, replication or a combination of both.

fio 2.8.1 vs fio 3.33 data patterns

Therefore if you are comparing performance results generated using fio version < 3.3 and fio >=3.3 the random write performance on the same storage platform my appear reduced with more recent fio versions.

fio-3.3 was released in December 2017 but older fio versions are still in use particularly on distributions with long term (LTS) support. For instance Ubuntu 16, which is supported until 2026 ships with fio-2.2.10

Continue reading

Specifying Drive letters with fio for Windows.

fio on Windows

Download pre-compiled fio binary for Windows

Example fio windows file, single drive

This will create a 1GB file called fiofile on the F:\ Drive in Windows then read the file.  Notice that the specification is “Driveletter” “Backslash” “Colon” “Filename”

In fio terms we are “escaping” the : which fio traditionally uses as a file separator.

[global]
bs=1024k
size=1G
time_based
runtime=30
rw=read
direct=1
iodepth=8

[job1]
filename=F\:fiofile 
Continue reading

How to monitor SQLServer on Windows with Prometheus

TL;DR

  • Enable SQLServer agent in SSMS
  • Install the Prometheus Windows exporter from github the installer is in the Assets section near the bottom of the page
  • Install Prometheus scraper/database to your monitoring server/laptop via the appropriate installer
  • Point a browser to the prometheus server e.g. :9090
    • Add a new target, which will be the Windows exporter installed in step.
    • It will be something like <SQLSERVERIP>:9182/metrics
    • Ensure the Target shows “Green”
  • Check that we can scrape SQLserver tranactions. In the search/execute box enter something like this
    rate(windows_mssql_sqlstats_batch_requests[30s])*60
  • Put the SQLserver under load with something like HammerDB
  • Hit Execute on the Prometheus server search box and you should see a transaction rate similar to HammerDB
  • Install Grafana and Point it to the Prometheus server (See multiple examples of how to do this)
Continue reading

Generate load on Microsoft SQLserver Windows from HammerDB on Linux

HammerDB on Linux driving load to Windows SQL Server

Often it’s nice to be able to drive Windows applications and databases from Linux, especially if you are more comfortable in a Unix environment. This post will show you how to drive a Microsoft SQL Server database running on a Windows server from a remote Linux machine. In this example I am using Ubuntu 22.04, SQLserver 2019, Windows 11 and HammerDB 4.4

Continue reading

QCOW 3 Ways

How to mount QCOW images as Linux block devices

tl;dr
  • guestmount (requires libguestfs-tools) sudo guestmount -d <vm-name> --ro -i <mountpoint>
  • qemu-nbd (requires the nbd driver)
    • Load the kernel module modprobe nbd max_part=8
    • Bind the device to the image qemu-nbd --connect=/dev/nbd0 <vmdiskimage.qcow>
    • Assuming partition #1 is the target mount /dev/ndb0p1 /a
  • loopback mount. Requires converting qcow to raw
    • Convert qcow to raw qemu-img convert vmdisk.qcow2 -f qcow2 -O raw vmdisk.raw
    • Create a loopback device losetup -f -P vmdisk.raw
    • Locate name of loopback device losetup -l | grep vmdisk.raw
    • Mount (assuming partition #1 on loopback device 99 mount /dev/mapper/loop99p1 /a
Continue reading

Using cloud-init with AHV command line

TL;DR

  • Using cloud-init with AHV is conceptually identical to using KVM/QEMU- we need to use a few different tools with AHV
  • You will need a Linux image that is configured to use cloud-init. A good source is cloud-images.ubuntu.com
  • We will create a cloud-init textual file and create a mountable version using the cloud-localds tool on a Linux host
  • We will attach the cloud-init enabled ubuntu image and our cloud-init customization file to the VM at boot time
  • At boottime ubuntu will access the cloud-init data mounted as a CDROM and do the customization for us
Continue reading

Comparing RDS and Nutanix Cluster performance with HammerDB

tl;dr

In a recent experiment using Amazon RDS instance and a VM running in an on-prem Nutanix cluster, both using Skylake class processors with similar clock speeds and vCPU count. The SQLServer database on Nutanix delivered almost 2X the transaction rate as the same workload running on Amazon RDS.

It turns out that migrating an existing SQLServer VM to RDS using the same vCPU count as on-prem may yield only half the expected performance for CPU heavy database workloads. The root cause is how Amazon thinks about vCPU compared to on-prem.

Benchmark Results

HammerDB results from RDS and Nutanix
Continue reading

Single threaded DB performance on Nutanix HCI

tl;dr

A Nutanix cluster can persist a replicated write across two nodes in around 250 uSec which is critical for single-threaded DB write workloads. The performance compares very well with hosted cloud database instances using the same class of processor (db.r5.4xlarge in the figure below). The metrics below are for SQL insert transactions not the underlying IO.

Single threaded commit heavy insert rates. Latency as seen from SQL insert statement.
Continue reading