SQL Server uses only one NUMA Node with HammerDB

Some versions of HammerDB (e.g. 3.2) may induce imbalanced NUMA utilization with SQL Server.

This can easily be observed with Resource monitor. When NUMA imbalance occurs one of the NUMA nodes will show much larger utilization than the other. E.g.

Imbalanced NUMA usage by SQL Server.

The cause and fix is well documented on this blog. In short HammerDB issues a short lived connection, for every persistent connection. This causes the SQL Server Round-robin allocation to send all the persistent worker threads to a single NUMA Node! To resolve this issue, simply comment out line #212 in the driver script.

Comment out this line to work-around the HammerDB NUMA imbalance problem.

If successful you will immediately see that the NUMA nodes are more balanced. Whether this results in more/better performance will depend on exactly where the bottleneck is.

Balanced NUMA usage by SQL Server

HammerDB: Avoiding bottlenecks in client.

HammerDB is a great tool for running Database benchmarks. However it is very easy to create an artificial bottleneck which will give a very poor benchmark result.

When setting up HammerDB to run against even a moderate modern server, it is important to avoid displaying the client transaction outputs in the HammerDB UI.

In my case just making this simple changed increased my HammerDB results by over 6X. The reason is that HammerDB spends more time updating its own UI, than it does sending transactions to the DB. When I run HammerDB, I select “Log Output to Temp” and “Use Unique Log Name”.

  • Either:
    • Un-check the “Show results” mark.
    • Or Ensure that the results are logged to file, not displayed on screen
  • Otherwise the workload generator will become the bottleneck.
Show Output “Checked”
Show Output “Un-Checked”
HammerDB with “Show results” unchecked. SQL Server uses all the CPU (99%)
HammerDB with “Show results” checked. SQL Servr using ~20% HammerDB driver (Identified as wish86t) is using 99% of one CPU.

Measuring CPU performance with X-Ray and pgbench.

Nutanix X-Ray is well known for being able to model IO/Storage workloads, but what about workloads that are CPU bound?

X-Ray can run Ansible scripts on the X-Ray worker VMs, and by doing so we are able to provision almost any application. For our purposes we are going to use Postgres DB and the built-in benchmarking tool PGbench. I have deliberately created a very small DB which fits into the VM memory and does almost no IO.

The X-ray workload files can be found here.

X-Ray interface to pgbench

Using standard X-Ray YAML we are able to pass custom parameters such as how many Postgres VMs to deploy, how many clients and how many threads for pgbench to run.

When X-Ray runs the workload the results are displayed in the X-Ray UI. This time though the metric is Database transactions per second not IOPS or Storage throughput.

PGbench Transactions per second are displayed in the X-ray UI.

By running a variety of experiments, altering the number of VMs running the workload, I was able to plot the aggregate transactions/s and the per-VM value, which as expected decreases once the host platform CPU is saturated.

I was able to use the CPU bound nature of this particular workload to take a look at scheduling and CPU usage characteristics of different hypervisors and CPU types. I found that one combination gave better performance at low loads, but the other combination generated better performance under higher loads.

Per VM transaction count drops sharply once the number of VMs exceeds the CPU capacity of the host.

These sorts of experiments are quite straight-forward using X-Ray and Ansible. A special shout-out goes to GV who created the custom exporter to send the postgres transaction/s results back to X-ray in realtime.