Microsoft diskspd Part 3. Oddities and FAQ

Many storage performance testers are familiar with vdbench, and wish to use it to test Hyper-Converged (HCI) performance. To accurately performance test HCI you need to deploy workloads on all HCI nodes. However, deploying multiple VMs and coordinating vdbench can be tricky, so with X-ray we provide an easy way to run vdbench at scale. Here’s how to do it.

Continue reading →

How to identify NVME drive types and test throughput

Posted on March 17, 2020September 7, 2022 by gary

Dmitry Nosachev / CC BY-SA (https://creativecommons.org/licenses/by-sa/4.0)

Continue reading →

Why does my SSD not issue 1MB IO’s?

Posted on March 13, 2020April 3, 2024 by gary

First things First

https://commons.wikimedia.org/wiki/File:CDC9762-smd-drive.jpg — CDC 9762 SMD disk drive from 1974

Why do we tend to use 1MB IO sizes for throughput benchmarking?

To achieve the maximum throughput on a storage device, we will usually use a large IO size to maximize the amount of data is transferred per IO request. The idea is to make the ratio of data-transfers to IO requests as large as possible to reduce the CPU overhead of the actual IO request so we can get as close to the device bandwidth as possible. To take advantage of and pre-fetching, and to reduce the need for head movement in rotational devices, a sequential pattern is used.

For historical reasons, many storage testers will use a 1MB IO size for sequential testing. A typical fio command line might look like something this.

fio --name=read --bs=1m --direct=1 --filename=/dev/sda

Continue reading →

How to identify SSD types and measure performance.

Posted on March 9, 2020January 3, 2023 by gary

Thomas Springer / CC0 — Generic SSD Internal Layout

The real-world achievable SSD performance will vary depending on factors like IO size, queue depth and even CPU clock speed. It’s useful to know what the SSD is capable of delivering in the actual environment in which it’s used. I always start by looking at the performance claimed by the manufacturer. I use these figures to bound what is achievable. In other words, treat the manufacturer specs as “this device will go no faster than…”.

Identify SSD

Start by identifying the exact SSD type by using lsscsi. Note that the disks we are going to test are connected by ATA transport type, therefore the maximum queue depth that each device will support is 32.

# lsscsi 
 [1:0:0:0]    cd/dvd  QEMU     QEMU DVD-ROM     2.5+  /dev/sr0 
 [2:0:0:0]    disk    ATA      SAMSUNG MZ7LM1T9 404Q  /dev/sda 
 [2:0:1:0]    disk    ATA      SAMSUNG MZ7LM1T9 404Q  /dev/sdb 
 [2:0:2:0]    disk    ATA      SAMSUNG MZ7LM1T9 404Q  /dev/sdc 
 [2:0:3:0]    disk    ATA      SAMSUNG MZ7LM1T9 404Q  /dev/

The marketing name for these Samsung SSD’s is “SSD 850 EVO 2.5″ SATA III 1TB“

Identify device specs

The spec sheet for this ssd claims the following performance characteristics.

Workload (Max)	Spec	Measured
Sequential Read (QD=8)	540 MB/s	534
Sequential Write (QD=8)	520 MB/s	515
Read IOPS 4KB (QD=32)	98,000	80,00
Write IOPS 4KB (QD=32)	90,000	67,000

Continue reading →

Quick & Dirty Prometheus on OS-X

Posted on February 7, 2020January 3, 2023 by gary

How to install Prometheus on OS-X

Install prometheus

Download the compiled prometheus binaries from prometheus.io
Unzip the binary and cd into the directory.
Run the prometheus binary, from the command line, it will listen on port 9090

$ cd /Users/gary.little/Downloads/prometheus-2.16.0-rc.0.darwin-amd64
$ ./prometheus

From a local browser, point to localhost:9090

Add a collector/scraper to monitor the OS

Prometheus itself does not do much apart from monitor itself, to do anything useful we have to add a scraper/exporter module. The easiest thing to do is add the scraper to monitor OS-X itself. As in Linux the OS exporter is simply called “node exporter”.

Start by downloading the pre-compiled darwin node exporter from prometheus.io

Unzip the tar.gz
cd into the directory
run the node exporter

$ cd /Users/gary.little/Downloads/node_exporter-0.18.1.darwin-amd64
$ ./node_exporter
 INFO[0000] Starting node_exporter (version=0.18.1, branch=HEAD, revision=3db77732e925c08f675d7404a8c46466b2ece83e)  source="node_exporter.go:156"
 INFO[0000] Build context (go=go1.11.10, user=root@4a30727bb68c, date=20190604-16:47:36)  source="node_exporter.go:157"
 INFO[0000] Enabled collectors:                           source="node_exporter.go:97"
 INFO[0000]  - boottime                                   source="node_exporter.go:104"
 INFO[0000]  - cpu                                        source="node_exporter.go:104"
 INFO[0000]  - diskstats                                  source="node_exporter.go:104"
 INFO[0000]  - filesystem                                 source="node_exporter.go:104"
 INFO[0000]  - loadavg                                    source="node_exporter.go:104"
 INFO[0000]  - meminfo                                    source="node_exporter.go:104"
 INFO[0000]  - netdev                                     source="node_exporter.go:104"
 INFO[0000]  - textfile                                   source="node_exporter.go:104"
 INFO[0000]  - time                                       source="node_exporter.go:104"
 INFO[0000] Listening on :9100                            source="node_exporter.go:170""

Continue reading →

SQL Server uses only one NUMA Node with HammerDB

Posted on October 7, 2019October 8, 2020 by gary

Some versions of HammerDB (e.g. 3.2) may induce imbalanced NUMA utilization with SQL Server.

This can easily be observed with Resource monitor. When NUMA imbalance occurs one of the NUMA nodes will show much larger utilization than the other. E.g.

The cause and fix is well documented on this blog. In short HammerDB issues a short lived connection, for every persistent connection. This causes the SQL Server Round-robin allocation to send all the persistent worker threads to a single NUMA Node! To resolve this issue, simply comment out line #212 in the driver script.

Comment out this line to work-around the HammerDB NUMA imbalance problem.

If successful you will immediately see that the NUMA nodes are more balanced. Whether this results in more/better performance will depend on exactly where the bottleneck is.

HammerDB: Avoiding bottlenecks in client.

Posted on October 4, 2019January 3, 2023 by gary

How to avoid bottlenecks in the client generator when measuring database performance with HammerDB

Continue reading →

Measuring CPU performance with X-Ray and pgbench.

Posted on October 1, 2019January 3, 2023 by gary

An X-ray workload for measuring application density

Continue reading →

How to run vertica vioperf tool

Posted on September 6, 2019January 3, 2023 by gary

The vertica vioperf tool is used to determine whether the storage you are planning on using is fast enough to feed the vertica database. When I initially ran the tool, the IO performance reported by the tool and confirmed by iostat was much lower than I expected for the storage device (a 6Gbit SATA device capable of around 500MB/s read and write).

The vioperf tool runs on a linux host or VM and can be pointed at any filesystem just like fio or vdbench

Simple execution of vioperf writing to the location /vertica

vioperf --thread-count=8 --duration=120s  /vertica

Working Set Size

Unlike traditional IO generators vioperf does not allow you to specify the working-set size. The amount of data written is simply 1MB* Achieved IO rate * runtime. So, fast storage with long run-times will need a lot of capacity otherwise the tool simply fills the partition and crashes!

Measurement and goodness

The primary metric is MB/s Per-Core. The idea is that you give 1 Thread per core in the system, though there is nothing stopping you from using whatever –thread-count value you like.

Although the measure is throughput, the primary metric of (Throughput/Core) does not improve just by giving lots of concurrency. Concurrency is generated purely by the number of threads and since the measure of goodness is Throughput/Core (or per thread) it’s not possible to simply create throughput from concurrency alone.

Throughput compared to FIo

Compared to fio the reported throughput is lower for the same device and same degree of concurrency. Vertica continually writes, and extends the files so there is some filesystem work going on whereas fio is typically overwriting an existing file. If you observe iostat during the vioperf run you will see that the IO size to disk is different than what an fio run will generate. Again this is due to the fact that vioperf is continually extending the file(s) being written and so it needs to update filesystem metadata quite frequently. These small metadata updates skew the average IO size lower.

fio with 1MB IO and 1 thread

Notice the avgrq size is 1024 blocks (512KB) which is the maximum transfer size that this drive supports.

 fio --filename=/samsung/vertica/file --size=5g --bs=1m --ioengine=libaio --iodepth=1 --rw=write --direct=1 --name=samsung --create_on_open=0


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.16    0.00    3.40    0.00    0.00   92.43

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  920.00     0.00 471040.00  1024.00     1.40    1.53    0.00    1.53   1.02  93.80

Vertica IOstat 1 thread

Firstly we see that iostat reports much lower disk throughput than what we achieved with fio for the same offered workload (1MB IO size with 1 outstanding IO (1 thread).

Also notice that that although vioperf issues 1MB IO sizes (which we can see from strace) iostat does not report the same 1024 block transfers as we see when we run iostat during an fio run (as above).

In the vioperf case the small metadata writes that are needed to continually extend the file cause a average IO size than than overwriting an existing file. Perhaps that is the cause of the lower performance?

./vioperf --duration=300s --thread-count=1 /samsung/vertica

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.77    0.13    2.38    5.26    0.00   83.46

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  627.00     0.00 223232.00   712.06     1.02    1.63    0.00    1.63   0.69  43.20

strace -f ./vioperf --duration=300s --thread-count=1 --disable-crc /samsung/vertica
...
[pid  1350] write(6, "v\230\242Q\357\250|\212\256+}\224\270\256\273\\\366k\210\320\\\330z[\26[\6&\351W%D"..., 1048576) = 1048576
[pid  1350] write(6, "B\2\224\36\250\"\346\241\0\241\361\220\242,\207\231.\244\330\3453\206'\320$Y7\327|5\204b"..., 1048576) = 1048576
[pid  1350] write(6, "\346r\341{u\37N\254.\325M'\255?\302Q?T_X\230Q\301\311\5\236\242\33\1)4'"..., 1048576) = 1048576
[pid  1350] write(6, "\5\314\335\264\364L\254x\27\346\3251\236\312\2075d\16\300\245>\256mU\343\346\373\17'\232\250n"..., 1048576) = 1048576
[pid  1350] write(6, "\272NKs\360\243\332@/\333\276\2648\255\v\243\332\235\275&\261\37\371\302<\275\266\331\357\203|\6"..., 1048576) = 1048576
[pid  1350] write(6, "v\230\242Q\357\250|\212\256+}\224\270\256\273\\\366k\210\320\\\330z[\26[\6&\351W%D"..., 1048576) = 1048576
...

However, look closely and you will notice that the %user is higher than fio for a lower IO rate AND the disk is not 100% busy. That seems odd.

./vioperf --duration=300s --thread-count=1 /samsung/vertica

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.77    0.13    2.38    5.26    0.00   83.46

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  627.00     0.00 223232.00   712.06     1.02    1.63    0.00    1.63   0.69  43.20

vioperf with –disable-crc

Finally we disable the crc checking (which vioperf does by default) to get a higher throughput more similar to what we see with fio.

It turns out that the lower performance was not due to the smaller IO sizes (and additonal filesystem work) but was caused the CRC checking that the tool does to simulate the vertica application.

 ./vioperf --duration=300s --thread-count=1 --disable-crc /samsung/vertica

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.77    0.13    2.38    5.26    0.00   83.46

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  627.00     0.00 223232.00   712.06     1.02    1.63    0.00    1.63   0.69  43.20

Impact of Data locality on DB workloads.

Posted on July 23, 2019October 8, 2020 by gary

Do database workloads benefit from data locality?

Continue reading →

Duplicate IP issues with Linux and virtual machine cloning.

Posted on July 10, 2019January 3, 2023 by gary

TL;DR – Some modern Linux distributions use a newer method of identification which, when combined with DHCP can result in duplicate IP addresses when cloning VMs, even when the VMs have unique MAC addresses.

To resolve, do the following ( remove file, run the systemd-machine-id-setup command, reboot):

# rm /etc/machine-id

# systemd-machine-id-setup

# reboot

When hypervisor management tools make clones of virtual machines, the tools usually make sure to create a unique MAC address for every clone. Combined with DHCP, this is normally enough to boot the clones and have them receive a unique IP. Recently, when I cloned several Bitnami guest VMs which are based on Debian, I started to get duplicate IP addresses on the clones. The issue can be resolved manually by following the above procedure.

To create a VM template to clone from which will generate a new machine-id for very clone, simply create an empty /etc/machine-id file (do not rm the file, otherwise the machine-id will not be generated)

# echo "" |  tee /etc/machine-id

The machine-id man page is a well written explanation of the implementation and motivation.

View from Nutanix storage during Postgres DB benchmark

Posted on June 28, 2019March 15, 2020 by gary

Following on from the previous [1] [2] experiments with Postgres & pgbench. A quick look at how the workload is seen from the Nutanix CVM.

The Linux VM running postgres has two virtual disks:

One is taking transaction log writes.
The other is doing reads and writes from the main datafiles.

Since the database size is small (50% the size of the Linux RAM) – the data is mostly cached inside the guest, and so most reads do not hit storage. As a result we only see writes going to the DB files.

Additionally, we see that database datafile writes the arrive in a bursty fashion, and that these write bursts are more intense (~10x) than the log file writes.

Charts from Prometheus/Grafana showing IO rates seen from the perspective of the Linux guest VM

Despite the database flushes ocurring in bursts with a decent amount of concurrency the Nutanix CVM provides an average of 1.5ms write response time.

From the Nutanix CVM port 2009 handler, we can access the individual vdisk statistics. In this particular case vDisk 45269 is the data file disk, and 40043 is the database transaction log disk.

Datafile writes completed in 1.5millisecond average – despite deep queues during burst

The vdisk categorizer correctly identifies the database datafile write pattern as highly random.

Writes to the datbase datafiles are almost entirely random

As a result, the writes are passed into the replicated oplog

The burst of writes hits the oplog as expected

Meanwhile the log writes are categorized as mostly sequential, which is expected for a database log file workload.

Meanwhile, log file writes are mostly categorized as sequential.

Even though the log writes are sequential, they are low-concurrency and small size (looks like mostly 16K-32K). This write pattern is also a good candidate for oplog.

These low-concurrency log writes also hit oplog

Benchmarking with Postgres PT2

Posted on June 28, 2019March 15, 2020 by gary

In this example we run pgbench with a scale factor of 1000 which equates to a database size of around 15GB. The linux VM has 32G RAM, so we don’t expect to see many reads.

Using prometheus with the Linux node exporter we can see the disk IO pattern from pgbench. As expected the write pattern to the log disk (sda) is quite constant, while the write pattern to the database files (sdb) is bursty.

pgbench with DB size 50% of Linux buffer cache.

I had to tune the parameter checkpoint_completion_target from 0.5 to 0.9 otherwise the SCSI stack became overwhelmed during checkpoints, and caused log-writes to stall.

default pgbench – notice the sharp drop in log-writes before tuning.

Benchmarking with Postgres PT1

Posted on June 28, 2019March 15, 2020 by gary

In this example, we use Postgres and the pgbench workload generator to drive some load in a virtual machine. Assume a Linux virtual machine that has Postgres installed. Specifically using a Bitnami virtual appliance.

Once the VM has been started, connect to the console
Allow access to postgres port 5432 – which is the postgres DB port or allow ssh

$ sudo ufw allow 5432

Note the postgres user password (cat ./bitnami_credentials)
Login to psql from the console or ssh

psql -U postgres

Optionally change password (the password prompted is the one from bitnami_credentials for the postgres database user).

psql -U postgres
postgres=# alter user postgres with password 'NEW_PASSWORD';
postgresl=# \q

Create a DB to run the pgbench workload. In this case I name the db pgbench-sf10 for “Scale Factor 10”. Scale Factors are how the size of the database is determined.

$ sudo -u postgres createdb pgbench-sf10

Initialise the DB with data ready to run the benchmark. The “createdb” step just creates an empty schema.
- -i means “initialize”
- -s means “scale factor” e.g. 10
- pgbench-sf10 is the database schema to use. We use the one just created pgbench-sf10

$ sudo -u postgres pgbench -i -s 10 pgbench-sf10

Noe run a workload against the DB schema called pgbench-sf10

$ sudo -u postgres pgbench pgbench-sf10

The workload pattern, and load on the system will vary greatly depending on the scale factor.

Scale-Factor Working Set Size

1 23M
10 157M
100 1.7GB
1000 15GB
2500 37GB
5000 74GB
10000 147GB

Install a bitnami image to Nutanix AHV cluster.

Posted on June 28, 2019June 28, 2019 by gary

One of the nice things about using public cloud is the ability to use pre-canned application virtual appliances created by companies like Bitnami.

We can use these same appliance images on Nutanix AHV to easily do a Postgres database benchmark

Step 1. Get the bitnami image

wget  https://bitnami.com/redirect/to/587231/bitnami-postgresql-11.3-0-r56-linux-debian-9-x86_64.zip

Step 2. Unzip the file and convert the bitnami vmdk images to a single qcow2[1] file.

qemu-img convert *vmdk bitnami.qcow2

Put the bitnami.qcow2 image somewhere accessible to a browser, connected to the Prism service, then upload using the “Image Configuration”

Once the image is uploaded, it’s time to create a new VM based on that image

Once booted, you’ll see the bitnami logo and you can configure the bitnami passwords, enable ssh etc. using the console.

Enable/disable ssh in bitnami images
Connecting to Postgres in bitnami images
Note – when you “sudo -c postgres <some-psql-tool> the password it is asking for is the Postgres DB password (stored in ./bitnami-credentials) not any unix user password.

Once connected to the appliance we can use postgres and pgbench to generate simplistic database workload.

[1] Do this on a Linux box somewhere. For some reason the conversion failed using my qemu utilities installed via brew. Importing OVAs direct into AHV should be available in the future.

Paper: A Nine year study of filesystem and storage benchmarking

Posted on May 6, 2019January 3, 2023 by gary

A 2007 paper, that still has lots to say on the subject of benchmarking storage and filesystems. Primarily aimed at researchers and developers, but is relevant to anyone about to embark on a benchmarking effort.

A Nine year study of filesystem and storage benchmarking Download

Use a mix of macro and micro benchmarks
Understand what you are testing, cached results are fine – as long as that is what you had intended.

The authors are clear on why benchmarks remain important:

“Ideally, users could test performance in their own settings using real work- loads. This transfers the responsibility of benchmarking from author to user. However, this is usually impractical because testing multiple systems is time consuming, especially in that exposing the system to real workloads implies learning how to configure the system properly, possibly migrating data and other settings to the new systems, as well as dealing with their respective bugs.”

We cannot expect end-users to be experts in benchmarking. It is out duty as experts to provide the tools (benchmarks) that enable users to make purchasing decisions without requiring years of benchmarking expertise.