Measuring CPU performance with X-Ray and pgbench.

Nutanix X-Ray is well known for being able to model IO/Storage workloads, but what about workloads that are CPU bound?

X-Ray can run Ansible scripts on the X-Ray worker VMs, and by doing so we are able to provision almost any application. For our purposes we are going to use Postgres DB and the built-in benchmarking tool PGbench. I have deliberately created a very small DB which fits into the VM memory and does almost no IO.

The X-ray workload files can be found here.

X-Ray interface to pgbench

Using standard X-Ray YAML we are able to pass custom parameters such as how many Postgres VMs to deploy, how many clients and how many threads for pgbench to run.

When X-Ray runs the workload the results are displayed in the X-Ray UI. This time though the metric is Database transactions per second not IOPS or Storage throughput.

PGbench Transactions per second are displayed in the X-ray UI.

By running a variety of experiments, altering the number of VMs running the workload, I was able to plot the aggregate transactions/s and the per-VM value, which as expected decreases once the host platform CPU is saturated.

I was able to use the CPU bound nature of this particular workload to take a look at scheduling and CPU usage characteristics of different hypervisors and CPU types. I found that one combination gave better performance at low loads, but the other combination generated better performance under higher loads.

Per VM transaction count drops sharply once the number of VMs exceeds the CPU capacity of the host.

These sorts of experiments are quite straight-forward using X-Ray and Ansible. A special shout-out goes to GV who created the custom exporter to send the postgres transaction/s results back to X-ray in realtime.

How to run vertica vioperf tool

The vertica vioperf tool is used to determine whether the storage you are planning on using is fast enough to feed the vertica database. When I initially ran the tool, the IO performance reported by the tool and confirmed by iostat was much lower than I expected for the storage device (a 6Gbit SATA device capable of around 500MB/s read and write).

The vioperf tool runs on a linux host or VM and can be pointed at any filesystem just like fio or vdbench

Simple execution of vioperf writing to the location /vertica

vioperf --thread-count=8 --duration=120s  /vertica

Working Set Size

Unlike traditional IO generators vioperf does not allow you to specify the working-set size. The amount of data written is simply 1MB* Achieved IO rate * runtime. So, fast storage with long run-times will need a lot of capacity otherwise the tool simply fills the partition and crashes!

Measurement and goodness

The primary metric is MB/s Per-Core. The idea is that you give 1 Thread per core in the system, though there is nothing stopping you from using whatever –thread-count value you like.

Although the measure is throughput, the primary metric of (Throughput/Core) does not improve just by giving lots of concurrency. Concurrency is generated purely by the number of threads and since the measure of goodness is Throughput/Core (or per thread) it’s not possible to simply create throughput from concurrency alone.

Throughput compared to FIo

Compared to fio the reported throughput is lower for the same device and same degree of concurrency. Vertica continually writes, and extends the files so there is some filesystem work going on whereas fio is typically overwriting an existing file. If you observe iostat during the vioperf run you will see that the IO size to disk is different than what an fio run will generate. Again this is due to the fact that vioperf is continually extending the file(s) being written and so it needs to update filesystem metadata quite frequently. These small metadata updates skew the average IO size lower.

fio with 1MB IO and 1 thread

Notice the avgrq size is 1024 blocks (512KB) which is the maximum transfer size that this drive supports.

 fio --filename=/samsung/vertica/file --size=5g --bs=1m --ioengine=libaio --iodepth=1 --rw=write --direct=1 --name=samsung --create_on_open=0


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.16    0.00    3.40    0.00    0.00   92.43

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  920.00     0.00 471040.00  1024.00     1.40    1.53    0.00    1.53   1.02  93.80

Vertica IOstat 1 thread

Firstly we see that iostat reports much lower disk throughput than what we achieved with fio for the same offered workload (1MB IO size with 1 outstanding IO (1 thread).

Also notice that that although vioperf issues 1MB IO sizes (which we can see from strace) iostat does not report the same 1024 block transfers as we see when we run iostat during an fio run (as above).

In the vioperf case the small metadata writes that are needed to continually extend the file cause a average IO size than than overwriting an existing file. Perhaps that is the cause of the lower performance?

./vioperf --duration=300s --thread-count=1 /samsung/vertica

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.77    0.13    2.38    5.26    0.00   83.46

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  627.00     0.00 223232.00   712.06     1.02    1.63    0.00    1.63   0.69  43.20
strace -f ./vioperf --duration=300s --thread-count=1 --disable-crc /samsung/vertica
...
[pid  1350] write(6, "v\230\242Q\357\250|\212\256+}\224\270\256\273\\\366k\210\320\\\330z[\26[\6&\351W%D"..., 1048576) = 1048576
[pid  1350] write(6, "B\2\224\36\250\"\346\241\0\241\361\220\242,\207\231.\244\330\3453\206'\320$Y7\327|5\204b"..., 1048576) = 1048576
[pid  1350] write(6, "\346r\341{u\37N\254.\325M'\255?\302Q?T_X\230Q\301\311\5\236\242\33\1)4'"..., 1048576) = 1048576
[pid  1350] write(6, "\5\314\335\264\364L\254x\27\346\3251\236\312\2075d\16\300\245>\256mU\343\346\373\17'\232\250n"..., 1048576) = 1048576
[pid  1350] write(6, "\272NKs\360\243\332@/\333\276\2648\255\v\243\332\235\275&\261\37\371\302<\275\266\331\357\203|\6"..., 1048576) = 1048576
[pid  1350] write(6, "v\230\242Q\357\250|\212\256+}\224\270\256\273\\\366k\210\320\\\330z[\26[\6&\351W%D"..., 1048576) = 1048576
...

However, look closely and you will notice that the %user is higher than fio for a lower IO rate AND the disk is not 100% busy. That seems odd.

./vioperf --duration=300s --thread-count=1 /samsung/vertica

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.77    0.13    2.38    5.26    0.00   83.46

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  627.00     0.00 223232.00   712.06     1.02    1.63    0.00    1.63   0.69  43.20

vioperf with –disable-crc

Finally we disable the crc checking (which vioperf does by default) to get a higher throughput more similar to what we see with fio.

It turns out that the lower performance was not due to the smaller IO sizes (and additonal filesystem work) but was caused the CRC checking that the tool does to simulate the vertica application.

 ./vioperf --duration=300s --thread-count=1 --disable-crc /samsung/vertica

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.77    0.13    2.38    5.26    0.00   83.46

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  627.00     0.00 223232.00   712.06     1.02    1.63    0.00    1.63   0.69  43.20

Impact of Data locality on DB workloads.

Effect of removing CPU constraints and maintaining data locality on a running DB instance.

In this video I migrate a Postgres DB running PGbench benchmark. The DB is running on a Host which is CPU constrained. Once the VM is migrated to a less busy host the transaction rate immediately increases from ~15,000 to ~20,000. As the DB continues to run on the new host – the Nutanix storage detects the access patterns and “localizes” the data that the DB is accessing. Over the subsequent minutes the transaction rate increases to ~30,000 TPS.

The variation in the transaction rate is due to the benchmark itself, the transaction rate is not expected to be uniform. Many different queries are executing in parallel, some hitting RAM cache, some hitting storage.

N.B The Postgres DB is totally un-tuned and contains purely default settings. The aim of the experiment was to see how data-locality might affect a running database workload, not to generate the maximum TPS.