2M IOPS on a single VM with Nutanix HCI
Published: (Updated: ) by .
How to generate a lot of IOPS to a single VM.

The Recipe for a 4 node cluster.
This recipe works for a four node cluster with 100Gbit network. The disks use a small working-set size so that they fit in the CVM cache. It’s a micro-benchmark and not supposed to mimic the real world. Since I only have a 4-node cluster in my lab, I need to minimize the work done by the CVMs, and having the data in cache helps. Thanks to the X-Ray team for inspiring this experiment and making it easy.
- Use Load Balanced Volume Groups (VGLB) which allow disks for a single VM to be hosted across the cluster. In traditional HCI, disks for a given user virtual machine (UVM) are owned by the CVM on the same host. While this is great for saving network bandwidth – we can’t fully use the power of all the CVMs in the cluster to on a single VM
- Make sure the user VM has multiple vdisks and enough vCPUs to drive the IO. It turns out that at high IO rates – we need a decent amount of CPU cycles to move the data around
- Increase the number of FRODO threads that are available to connect the user VMs to the virtual disks. The default is 2 FRODO threads per VM, and that is good for about 850,000 IOPS – which is quite a lot. To get to 2M IOPS we need to up that value – in my experiments I used one FRODO thread per User VM CPU.
- 12 Disks on the User VM
- 12 CPUs on the User VM
- 12 FRODO threads in AHV
We can see that the disks are spread around the cluster
This is the result of using VGLB and enabling load balancing using acli.
vg.update TESTVG6-100 load_balance_vm_attachments=true

Then I use fio to create 12 jobs, and pin those threads to the 12 CPUs
[global]
numjobs=12
cpus_allowed=0-11
cpus_allowed_policy=split
When I run the job, using 8 OIO per disk I get the following output
2 Million IOPS with an average response time of 574 microseconds (a little over 0.5 milliseconds)
read: IOPS=2018k
lat (usec): avg=570.3

The Full fio script
[global]
numjobs=12
cpus_allowed=0-11
cpus_allowed_policy=split
time_based
runtime=3600
ioengine=libaio
direct=1
time_based
group_reporting
iodepth=8
size=1g
[disk0]
bs=4k
filename=/dev/sdh
rw=randread
[disk1]
bs=4k
filename=/dev/sdi
rw=randread
[disk2]
bs=4k
filename=/dev/sdj
rw=randread
[disk3]
bs=4k
filename=/dev/sdk
rw=randread
[disk4]
bs=4k
filename=/dev/sdl
rw=randread
[disk5]
bs=4k
filename=/dev/sdm
rw=randread
[disk6]
bs=4k
filename=/dev/sdn
rw=randread
[disk7]
bs=4k
filename=/dev/sdo
rw=randread
[disk8]
bs=4k
filename=/dev/sdp
rw=randread
[disk9]
bs=4k
filename=/dev/sdq
rw=randread
[disk10]
bs=4k
filename=/dev/sdr
rw=randread
[disk11]
bs=4k
filename=/dev/sds
rw=randread
Comments