Using a small python script we can liberate data from the “Analysis” page of prism element and send it to prometheus, where we can combine cluster metrics with other data and view them all on some nice Grafana dashboards.
Output
Prism Analysis Page | Grafana Dashboard |
method
The method we will use here is to create a python script which pulls stats from Prism Element via API – and exposes them in a format that Prometheus can consume. The available metrics expose many interesting details – but are updated only every 20-30 seconds. This is enough to do trending, and fits nicely with the typical Prometheus scrape interval.
The metrics are aggregated under three buckets. Per VM, Per Storage Container, Per Host and Cluster-wide.
Below is an example of creating the Per VM CPU panel – we divide the PPM metric by 10,000 to get a % which is what we see on the analysis page in Prism.
Useful metrics
Within these groupings/aggregations I have found the following metrics to be most useful to monitor resource usage on my test cluster. For CPU usage, the API seems to return what you would expect. e.g. for a VM – we get the % of provisioned vCPU used – and for the host we get the % of Physical CPU used.
Metric name | metric description |
controller_num_iops | Overall IO rate per VM, container, host or cluster per second |
controller_io_bandwidth_kBps | Overall throughput per VM, container, host or cluster in Kilobytes per second |
controller_avg_io_latency_usecs | Average IO response time per VM, container, host or cluster in microseconds |
hypervisor_cpu_usage_ppm | CPU usage expressed as parts per million (divide by 10,000) to get % |
Python Script
Run the below python code and supply the IP of a CVM (or the cluster VIP) a username and password. e.g I save the code below as entity_stats.py. Then point a prometheus scraper at port 8000 wherever this code is running. The heavy lifting is done by the promethus_client python module.
$ python ./entity_stats.py --vip <CVM_IP_ADDRESS> --username admin --password <password>
import requests
from requests.auth import HTTPBasicAuth
import json
import pprint
import prometheus_client
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway,Info
from prometheus_client import start_http_server, Summary
import time
import random
import argparse
import os
global filter_spurious_response_times,spurious_iops_threshold
#Attempt to filter spurious respnse time values for very low IO rates
filter_spurious_response_times=True
spurious_iops_threshold=50
# Entity Centric groups the stats by entities (e.g. vms, containers, hosts) - counters are labels
def main():
global username,password
parser=argparse.ArgumentParser()
parser.add_argument("-v","--vip",action="store",help="The Virtual IP for the cluster",default=os.environ.get("VIP"))
parser.add_argument("-u","--username",action="store",help="The prism username",default=os.environ.get("PRISMUSER"))
parser.add_argument("-p","--password",action="store",help="The prism password",default=os.environ.get("PRISMPASS"))
args=parser.parse_args()
vip=args.vip
username=args.username
password=args.password
if not (vip and username and password):
print("Need a vip, username and password")
exit(1)
check_prism_accessible(vip)
#Instantiate the prometheus guages to store metrics
setup_prometheus_endpoint_entity_centric()
#Start prometheus end-point on port 8000 after the Gauges are instantiated
start_http_server(8000)
#Loop forever getting the metrics for all available entities (They may come and go)
#then expose the metrics for those entities on prometheus exporter ready for scraping
while(True):
for family in ["containers","vms","hosts","clusters"]:
entities=get_entity_names(vip,family)
push_entity_centric_to_prometheus(family,entities)
#The counters are meant for trending and are quite coarse
#10s of seconds is a reasonable scrape interval
time.sleep(10)
def setup_prometheus_endpoint_entity_centric():
#
# Setup gauges for VMs Hosts and Containers
#
global gVM,gHOST,gCTR,gCLUSTER
prometheus_client.instance_ip_grouping_key()
gVM = Gauge('vms', 'Stats grouped by VM',labelnames=['vm_name','metric_name'])
gHOST = Gauge('hosts', 'Stats grouped by Pysical Host',labelnames=['host_name','metric_name'])
gCTR = Gauge('containers', 'Stats grouped by Storage Container',labelnames=['container_name','metric_name'])
gCLUSTER = Gauge('cluster','Stats grouped by cluster',labelnames=['cluster_name','metric_name'])
def push_entity_centric_to_prometheus(family,entities):
if family == "vms":
gGAUGE=gVM
if family == "containers":
gGAUGE=gCTR
if family == "hosts":
gGAUGE=gHOST
if family == "clusters":
gGAUGE=gCLUSTER
#Get data from the dictionary passed in and set the gauges
for entity in entities:
#Each family may use a different identifier for the entity name.
if family == "containers":
entity_name=entity["name"]
if family == "vms":
entity_name=entity["vmName"]
if family == "hosts":
entity_name=entity["name"]
if family == "clusters":
entity_name=entity["name"]
# regardless of the family, the stats are always stored in a
# structure called stats. Within the stats structure the data
# is layed out as Key:Value. We just walk through make a prometheus
# guage for whatever we find
for metric_name in entity["stats"]:
stat_value=entity["stats"][metric_name]
if any(prefix in metric_name for prefix in ["controller","hypervisor","guest"]):
print(entity_name,metric_name,stat_value)
gid=gGAUGE.labels(entity_name,metric_name)
gid.set(stat_value)
#Overwrite value with -1 if IO rate is below spurious IO rate threshold. This is
#to avoid misleading response times for entities that are doing very little IO
if filter_spurious_response_times:
print("Supressing spurious values - entity centric - family",entity_name,family)
read_rate_iops=entity["stats"]["controller_num_read_iops"]
write_rate_iops=entity["stats"]["controller_num_write_iops"]
rw_rate_iops=entity["stats"]["controller_num_iops"]
if (int(read_rate_iops)<spurious_iops_threshold):
print("read iops too low, supressing write response times")
gGAUGE.labels(entity_name,"controller_avg_read_io_latency_usecs").set("-1")
if (int(write_rate_iops)<spurious_iops_threshold):
print("write iops too low, supressing write response times")
gGAUGE.labels(entity_name,"controller_avg_write_io_latency_usecs").set("-1")
if (int(rw_rate_iops)<spurious_iops_threshold):
print("RW iops too low, supressing write response times")
gGAUGE.labels(entity_name,"controller_avg_io_latency_usecs").set("-1")
def get_entity_names(vip,family):
requests.packages.urllib3.disable_warnings()
v1_stat_VM_URL="https://"+vip+":9440/PrismGateway/services/rest/v1/"+family+"/"
response=requests.get(v1_stat_VM_URL, auth=HTTPBasicAuth(username,password),verify=False)
response.raise_for_status()
result=response.json()
entities=result["entities"]
return entities
def check_prism_accessible(vip):
#Check name resolution
url="http://"+vip
status = None
message = ''
try:
resp = requests.head('http://' + vip)
status = str(resp.status_code)
except:
if ("[Errno 11001] getaddrinfo failed" in str(vip) or # Windows
"[Errno -2] Name or service not known" in str(vip) or # Linux
"[Errno 8] nodename nor servname " in str(vip)): # OS X
message = 'DNSLookupError'
else:
raise
return url, status, message
if __name__ == '__main__':
main()
One comment on “Using Prometheus and Grafana to monitor a Nutanix Cluster.”
This looks really handy! Is there any extra system load when scraping every 10 seconds like you have here? How far away do you think we are from having a native metrics endpoint in the cvm?