fio versions < 3.3 may show inflated random write performance

Posted on February 15, 2023February 15, 2023 by gary

TL;DR

If your storage system implements inline compression, performance results with small IO size random writes with time_based and runtime may be inflated with fio versions < 3.3 due to fio generating unexpectedly compressible data when using fio’s default data pattern. Although unintuitive, performance can often be increased by enabling compression especially if the bottleneck is on the storage media, replication or a combination of both.

Therefore if you are comparing performance results generated using fio version < 3.3 and fio >=3.3 the random write performance on the same storage platform my appear reduced with more recent fio versions.

fio-3.3 was released in December 2017 but older fio versions are still in use particularly on distributions with long term (LTS) support. For instance Ubuntu 16, which is supported until 2026 ships with fio-2.2.10

Continue reading →

Specifying Drive letters with fio for Windows.

Posted on December 29, 2022December 29, 2022 by gary

fio on Windows

Download pre-compiled fio binary for Windows

Example fio windows file, single drive

This will create a 1GB file called fiofile on the F:\ Drive in Windows then read the file. Notice that the specification is “Driveletter” “Backslash” “Colon” “Filename”

In fio terms we are “escaping” the : which fio traditionally uses as a file separator.

[global]
bs=1024k
size=1G
time_based
runtime=30
rw=read
direct=1
iodepth=8

[job1]
filename=F\:fiofile

Continue reading →

Hunting for bandwidth on a consumer NVMe drive

Posted on December 22, 2022February 15, 2023 by gary

The Samsung SSD 970 EVO 500GB claims a sequential read bandwidth of 3400 MB/s this is a story of trying to achieve that number.

Continue reading →

Beware of tiny working-set-sizes when testing storage performance.

Posted on July 1, 2022September 7, 2022 by gary

I was recently asked to investigate why Nutanix storage was not as fast as a competing solution in a PoC environment. When I looked at the output from diskspd, the data didn’t quite make sense.

Continue reading →

Single threaded DB performance on Nutanix HCI

Posted on May 27, 2022January 3, 2023 by gary

tl;dr

A Nutanix cluster can persist a replicated write across two nodes in around 250 uSec which is critical for single-threaded DB write workloads. The performance compares very well with hosted cloud database instances using the same class of processor (db.r5.4xlarge in the figure below). The metrics below are for SQL insert transactions not the underlying IO.

Single threaded commit heavy insert rates. Latency as seen from SQL insert statement.

Continue reading →

Using rwmixread and rate_iops in fio

Posted on July 14, 2021December 29, 2022 by gary

Creating a mixed read/write workload with fio can be a bit confusing. Assume we want to create a fixed rate workload of 100 IOPS split 70:30 between reads and writes.

TL;DR

Specify the rate directly with rate_iops=<read-rate>,<write-rate> do not try to use rwmixread with rate_iops. For the example above use.

rate_iops=70,30

Additionally older versions of fio exhibit problems when using rate_poisson with rate_iops . fio version 3.7 that I was using did not exhibit the problem.

Continue reading →

Understanding fio norandommap and randrepeat parameters

Posted on May 6, 2021December 29, 2022 by gary

The parameters norandommap and randrepeat significantly change the way that repeated random IO workloads will be executed, and also can meaningfully change the results of an experiment due to the way that caching works on most storage system.

Continue reading →

How to identify NVME drive types and test throughput

Posted on March 17, 2020September 7, 2022 by gary

Dmitry Nosachev / CC BY-SA (https://creativecommons.org/licenses/by-sa/4.0)

Continue reading →

How to identify SSD types and measure performance.

Posted on March 9, 2020January 3, 2023 by gary

Thomas Springer / CC0 — Generic SSD Internal Layout

The real-world achievable SSD performance will vary depending on factors like IO size, queue depth and even CPU clock speed. It’s useful to know what the SSD is capable of delivering in the actual environment in which it’s used. I always start by looking at the performance claimed by the manufacturer. I use these figures to bound what is achievable. In other words, treat the manufacturer specs as “this device will go no faster than…”.

Identify SSD

Start by identifying the exact SSD type by using lsscsi. Note that the disks we are going to test are connected by ATA transport type, therefore the maximum queue depth that each device will support is 32.

# lsscsi 
 [1:0:0:0]    cd/dvd  QEMU     QEMU DVD-ROM     2.5+  /dev/sr0 
 [2:0:0:0]    disk    ATA      SAMSUNG MZ7LM1T9 404Q  /dev/sda 
 [2:0:1:0]    disk    ATA      SAMSUNG MZ7LM1T9 404Q  /dev/sdb 
 [2:0:2:0]    disk    ATA      SAMSUNG MZ7LM1T9 404Q  /dev/sdc 
 [2:0:3:0]    disk    ATA      SAMSUNG MZ7LM1T9 404Q  /dev/

The marketing name for these Samsung SSD’s is “SSD 850 EVO 2.5″ SATA III 1TB“

Identify device specs

The spec sheet for this ssd claims the following performance characteristics.

Workload (Max)	Spec	Measured
Sequential Read (QD=8)	540 MB/s	534
Sequential Write (QD=8)	520 MB/s	515
Read IOPS 4KB (QD=32)	98,000	80,00
Write IOPS 4KB (QD=32)	90,000	67,000

Continue reading →

How to run vertica vioperf tool

Posted on September 6, 2019January 3, 2023 by gary

The vertica vioperf tool is used to determine whether the storage you are planning on using is fast enough to feed the vertica database. When I initially ran the tool, the IO performance reported by the tool and confirmed by iostat was much lower than I expected for the storage device (a 6Gbit SATA device capable of around 500MB/s read and write).

The vioperf tool runs on a linux host or VM and can be pointed at any filesystem just like fio or vdbench

Simple execution of vioperf writing to the location /vertica

vioperf --thread-count=8 --duration=120s  /vertica

Working Set Size

Unlike traditional IO generators vioperf does not allow you to specify the working-set size. The amount of data written is simply 1MB* Achieved IO rate * runtime. So, fast storage with long run-times will need a lot of capacity otherwise the tool simply fills the partition and crashes!

Measurement and goodness

The primary metric is MB/s Per-Core. The idea is that you give 1 Thread per core in the system, though there is nothing stopping you from using whatever –thread-count value you like.

Although the measure is throughput, the primary metric of (Throughput/Core) does not improve just by giving lots of concurrency. Concurrency is generated purely by the number of threads and since the measure of goodness is Throughput/Core (or per thread) it’s not possible to simply create throughput from concurrency alone.

Throughput compared to FIo

Compared to fio the reported throughput is lower for the same device and same degree of concurrency. Vertica continually writes, and extends the files so there is some filesystem work going on whereas fio is typically overwriting an existing file. If you observe iostat during the vioperf run you will see that the IO size to disk is different than what an fio run will generate. Again this is due to the fact that vioperf is continually extending the file(s) being written and so it needs to update filesystem metadata quite frequently. These small metadata updates skew the average IO size lower.

fio with 1MB IO and 1 thread

Notice the avgrq size is 1024 blocks (512KB) which is the maximum transfer size that this drive supports.

 fio --filename=/samsung/vertica/file --size=5g --bs=1m --ioengine=libaio --iodepth=1 --rw=write --direct=1 --name=samsung --create_on_open=0


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.16    0.00    3.40    0.00    0.00   92.43

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  920.00     0.00 471040.00  1024.00     1.40    1.53    0.00    1.53   1.02  93.80

Vertica IOstat 1 thread

Firstly we see that iostat reports much lower disk throughput than what we achieved with fio for the same offered workload (1MB IO size with 1 outstanding IO (1 thread).

Also notice that that although vioperf issues 1MB IO sizes (which we can see from strace) iostat does not report the same 1024 block transfers as we see when we run iostat during an fio run (as above).

In the vioperf case the small metadata writes that are needed to continually extend the file cause a average IO size than than overwriting an existing file. Perhaps that is the cause of the lower performance?

./vioperf --duration=300s --thread-count=1 /samsung/vertica

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.77    0.13    2.38    5.26    0.00   83.46

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  627.00     0.00 223232.00   712.06     1.02    1.63    0.00    1.63   0.69  43.20

strace -f ./vioperf --duration=300s --thread-count=1 --disable-crc /samsung/vertica
...
[pid  1350] write(6, "v\230\242Q\357\250|\212\256+}\224\270\256\273\\\366k\210\320\\\330z[\26[\6&\351W%D"..., 1048576) = 1048576
[pid  1350] write(6, "B\2\224\36\250\"\346\241\0\241\361\220\242,\207\231.\244\330\3453\206'\320$Y7\327|5\204b"..., 1048576) = 1048576
[pid  1350] write(6, "\346r\341{u\37N\254.\325M'\255?\302Q?T_X\230Q\301\311\5\236\242\33\1)4'"..., 1048576) = 1048576
[pid  1350] write(6, "\5\314\335\264\364L\254x\27\346\3251\236\312\2075d\16\300\245>\256mU\343\346\373\17'\232\250n"..., 1048576) = 1048576
[pid  1350] write(6, "\272NKs\360\243\332@/\333\276\2648\255\v\243\332\235\275&\261\37\371\302<\275\266\331\357\203|\6"..., 1048576) = 1048576
[pid  1350] write(6, "v\230\242Q\357\250|\212\256+}\224\270\256\273\\\366k\210\320\\\330z[\26[\6&\351W%D"..., 1048576) = 1048576
...

However, look closely and you will notice that the %user is higher than fio for a lower IO rate AND the disk is not 100% busy. That seems odd.

./vioperf --duration=300s --thread-count=1 /samsung/vertica

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.77    0.13    2.38    5.26    0.00   83.46

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  627.00     0.00 223232.00   712.06     1.02    1.63    0.00    1.63   0.69  43.20

vioperf with –disable-crc

Finally we disable the crc checking (which vioperf does by default) to get a higher throughput more similar to what we see with fio.

It turns out that the lower performance was not due to the smaller IO sizes (and additonal filesystem work) but was caused the CRC checking that the tool does to simulate the vertica application.

 ./vioperf --duration=300s --thread-count=1 --disable-crc /samsung/vertica

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.77    0.13    2.38    5.26    0.00   83.46

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  627.00     0.00 223232.00   712.06     1.02    1.63    0.00    1.63   0.69  43.20

Working with fio “distribution /pereto” parameter

Posted on November 12, 2018January 4, 2019 by gary

The fio Pareto parameter allows us to create a workload, which references a very large dataset, but specify a hotspot for the access pattern. Here’s an example using the same setup as the ILM experiment, but using a Pareto value of 0:8. My fio file looks like this..

[global]
ioengine=libaio
direct=1
time_based
norandommap
random_distribution=pareto:0.8
The experiment shows that with the access pattern as a Pareto ratio 0:8, meaning 20% of the overall dataset is “hot” the ILM process happens much faster as the hotspot is smaller, and is identified faster than a 100% uniform random access pattern. We would expect a similar shape for any sort of caching mechanism.

The return of misaligned IO

Posted on May 25, 2017January 3, 2023 by gary

We have started seeing misaligned partitions on Linux guests runnning certain HDFS distributions. How these partitions became mis-aligned is a bit of a mystery, because the only way I know how to do this on Linux is to create a partition using old DOS format like this (using -c=dos and -u=cylinders) Continue reading →

High Response time and low throughput in vCenter performance charts.

Posted on May 16, 2017September 7, 2022 by gary

Often we are presented with a vCenter screenshot, and an observation that there are “high latency spikes”. In the example, the response time is indeed quite high – around 80ms. Continue reading →

Creating compressible data with fio.

Posted on October 18, 2016July 13, 2019 by gary

Today I used fio to create some compressible data to test on my Nutanix nodes. I ended up using the following fio params to get what I wanted.

buffer_compress_percentage=50
refill_buffers
buffer_pattern=0xdeadbeef

buffer_compress_percentage does what you’d expect and specifies how compressible the data is
refill_buffers Is required to make the above compress percentage do what you’d expect in the large. IOW, I want the entire file to be compressible by the buffer_compress_percentage amount
buffer_pattern This is a big one. Without setting this pattern, fio will use Null bytes to achieve compressibility, and Nutanix like many other storage vendors will suppress runs of Zero’s and so the data reduction will mostly be from zero suppression rather than from compression.

Much of this is well explained in the README for latest version of fio.

Also NOTE Older versions of fio do not support many of the fancy data creation flags, but will not alert you to the fact that fio is ignoring them. I spent quite a bit of time wondering why my data was not compressed, until I downloaded and compiled the latest fio.

Things to know when using vdbench.

Posted on September 21, 2015July 13, 2019 by gary

Recently I found that vdbench was not giving me the amount of outstanding IO that I had intended to configure by using the “threads=N” parameter. It turned out that with Linux, most of the filesystems (ext2, ext3 and ext4) do not support concurrent directIO, although they do support directIO. This was a bit of a shock coming from Solaris which had concurrent directIO since 2001.

All the Linux filesystems I tested allow multiple outstanding IO’s if the IO is submitted using asynchronous IO (AKA asyncIO or AIO) but not when using multiple writer threads (except XFS). Unfortunately vdbench does not allow AIO since it tries to be platform agnostic.

fio however does allow either threads or AIO to be used and so that’s what I used in the experiments below.

The column fio QD is the amount of outstanding IO, or Queue Depth that is intended to be passed to the storage device. The column iostat QD is the actual Queue Depth seen by the device. The iostat QD is not “8” because the response time is so low that fio cannot issue the IO’s quickly enough to maintain the intended queue depth.

Device	fio QD	fio QD Type	direct	iostat QD	ps -efT \| grep fio \| wc -l
/dev/sd	8	libaio	Yes	7	5
/dev/sd	8	Threads	Yes	7	12
ext2 fs (mke2fs)	8	Threads	Yes	1	12
ext2 fs (mke2fs)	8	libaio	Yes	7	5
ext3 (mkfs -t ext3)	8	Threads	Yes	1	12
ext3 (mkfs -t ext3)	8	libaio	Yes	7	5
ext4 (mkfs -t ext4)	8	Threads	Yes	1	12
ext4 (mkfs -t ext4)	8	libaio	Yes	7	5
xfs (mkfs -t xfs)	8	Threads	Yes	7	12
xfs (mkfs -t xfs)	8	libaio	Yes	7	5

At any rate, all is not lost – using raw devices (/dev/sdX) will give concurrent directIO, as will XFS. These issues are well known by Linux DB guys, and I found interesting articles from Percona and Kevin Closson after I finally figured out what was going on with vdbench.

fio “scripts”

For the “threads” case.

[global]
bs=8k
ioengine=sync
iodepth=8
direct=1
time_based
runtime=60
numjobs=8
size=1800m

[randwrite-threads]
rw=randwrite
filename=/a/file1

For the “aio” case

[global]
bs=8k
ioengine=libaio
iodepth=8
direct=1
time_based
runtime=60
size=1800m


[randwrite-aio]
rw=randwrite
filename=/a/file1

Multiple devices/jobs in fio

Posted on June 17, 2014February 28, 2020 by gary

If your underlying filesystem/devices have different response times (e.g. some devices are cached – or are on SSD) and others are on spinning disk, then the behavior of fio can be quite different depending on how the fio config file is specified. Typically there are two approaches

1) Have a single “job” that has multiple devices

2) Make each device a “job”

With a single job, the iodepth parameter will be the total iodepth for the job (not per device) . If multiple jobs are used (with one device per job) then the iodepth value is per device.

Option 1 (a single job) results in [roughly] equal IO across disks regardless of response time. This is like having a volume manager or RAID device, in that the overall oprate is limited by the slowest device.

For example, notice that even though the wait/response times are quite uneven (ranging from 0.8 ms to 1.5ms) the r/s rate is quite even. You will notice though the that queue size is very variable so as to achieve similar throughput in the face of uneven response times.

To get this sort of behavior use the following fio syntax. We allow fio to use up to 128 outstanding IO’s to distribute amongst the 8 “files” or in this case “devices”. In order to maintain the maximum throughput for the overall job, the devices with slower response times have more outstanding IO’s than the devices with faster response times.

[global]
bs=8k
iodepth=128
direct=1
ioengine=libaio
randrepeat=0
group_reporting
time_based
runtime=60
filesize=6G

[job1]
rw=randread
filename=/dev/sdb:/dev/sda:/dev/sdd:/dev/sde:/dev/sdf:/dev/sdg:/dev/sdh:/dev/sdi
name=random-read

The second option, give an uneven throughput because each device is linked to a separate job, and so is completely independent. The iodepth parameter is specific to each device, so every device has 16 outstanding IO’s. The throughput (r/s) is directly tied to the response time of the specific device that it is working on. So response times that are 10x faster generate throughput that is 10x faster. For simulating real workloads this is probably not what you want.

For instance when sizing workingset and cache, the disks that have better throughput may dominate the cache.

[global]
bs=8k
iodepth=16
direct=1
ioengine=libaio
randrepeat=0
group_reporting
time_based
runtime=60
filesize=2G

[job1]
rw=randread
filename=/dev/sdb
name=raw=random-read
[job2]
rw=randread
filename=/dev/sda
name=raw=random-read
[job3]
rw=randread
filename=/dev/sdd
name=raw=random-read
[job4]
rw=randread
filename=/dev/sde
name=raw=random-read
[job5]
rw=randread
filename=/dev/sdf
name=raw=random-read
[job6]
rw=randread
filename=/dev/sdg
name=raw=random-read
[job7]
rw=randread
filename=/dev/sdh
name=raw=random-read
[job8]
rw=randread
filename=/dev/sdi
name=raw=random-read

Here be Zeroes

Posted on June 12, 2014January 3, 2023 by gary

zero-cup

Many storage devices/filesystems treat blocks containing nothing but zeros in a special way, often short-circuiting reads from the back-end. This is normally a good thing but this behavior can cause odd results when benchmarking. This typically comes up when testing against storage using raw devices that have been thin provisioned.

In this example, I have several disks attached to my linux virtual machine. Some of these disks contain data, but some of them have never been written to.

When I run an fio test against the disks, we can clearly see that the response time is better for some than for others. Here’s the fio output…

fio believes that it is issuing the same workload to all disks.

and here is the output of iostat -x

The devices sdf, sdg and sdh are thin provisioned disks that have never been written to. The read response times are be much lower. Even though the actual storage is identical.

There are a few ways to detect that the data being read is all zero’s.

Firstly use a simple tool like unix “od” or “hd” to dump out a small section of the disk device and see what it contains. In the example below I just take the first 1000 bytes and check to see if there is any data.

Secondly, see if your storage/filesystem has a way to show that it read zeros from the device. NDFS has a couple of ways of doing that. The easiest is to look at the 2009:/latency page and look for the stage “FoundZeroes”.

If your storage is returning zeros and so making your benchmarking problematic, you will need to get some data onto the disks! Normally I just do a large sequential write with whatever benchmarking tool that I am using. Both IOmeter and fio will write “junk” to disks when writing.