Using rwmixread and rate_iops in fio

Creating a mixed read/write workload with fio can be a bit confusing. Assume we want to create a fixed rate workload of 100 IOPS split 70:30 between reads and writes.

Don’t mix rwmixread and rate_iops
TL;DR

Specify the rate directly with rate_iops=<read-rate>,<write-rate> do not try to use rwmixread with rate_iops. For the example above use.

rate_iops=70,30 

Additionally older versions of fio exhibit problems when using rate_poisson with rate_iops . fio version 3.7 that I was using did not exhibit the problem.

Continue reading

Creating compressible data with fio.

binary-code-507785_1280

Today I used fio to create some compressible data to test on my Nutanix nodes.  I ended up using the following fio params to get what I wanted.

 

buffer_compress_percentage=50
refill_buffers
buffer_pattern=0xdeadbeef
  • buffer_compress_percentage does what you’d expect and specifies how compressible the data is
  • refill_buffers Is required to make the above compress percentage do what you’d expect in the large.  IOW, I want the entire file to be compressible by the buffer_compress_percentage amount
  • buffer_pattern  This is a big one.  Without setting this pattern, fio will use Null bytes to achieve compressibility, and Nutanix like many other storage vendors will suppress runs of Zero’s and so the data reduction will mostly be from zero suppression rather than from compression.

Much of this is well explained in the README for latest version of fio.

Also NOTE  Older versions of fio do not support many of the fancy data creation flags, but will not alert you to the fact that fio is ignoring them. I spent quite a bit of time wondering why my data was not compressed, until I downloaded and compiled the latest fio.

 

Specifying Drive letters with fio for Windows.

windows_logo_-_2012_derivative

Simple fio file for using Drive letters on Windows.

This will create a file called “fiofile” on the F:\ Drive in Windows.  Notice that the specification is “Driveletter” “Backslash” “Colon” “Filename”

In fio terms we are “escaping” the “:” which fio traditionally uses as a file separator.

[global]
bs=1024k
size=1G
time_based
runtime=30
rw=read
direct=1
iodepth=8

[job1]
filename=F\:fiofile

To run IO to multiple drives (Add the group_reporting) flag to make the output more sane.

[global]
bs=1024k
size=1G
time_based
runtime=30
rw=read
direct=1
iodepth=8
group_reporting

[job1]
filename=F\:fiofile

[job2]
filename=G\:fiofile

Download fio for windows here

To use Physical Drives (Not formatted)

Use the format \\.\\PhysicalDrive<N>.

Firstly find the correct drive using diskpart -> list disk

DISKPART> list disk

Disk ### Status Size Free Dyn Gpt
-------- ------------- ------- ------- --- ---
Disk 0 Online 60 GB 0 B
Disk 1 Online 1024 GB 0 B *
Disk 2 Online 1024 GB 0 B *
Disk 3 Online 50 GB 0 B *
Disk 4 Online 50 GB 0 B *
Disk 5 Online 50 GB 0 B *
Disk 6 Online 50 GB 0 B *
Disk 7 Online 600 GB 1024 KB
Disk 8 Online 500 GB 1024 KB
Disk 9 Online 1024 GB 1024 KB
Disk 10 Online 77 GB 1024 KB

Then use the disk number you want in your fio file e.g.

[global]
ioengine=windowsaio
direct=1
time_based
runtime=60s
size=10g
[log]
rw=write
bs=4k
iodepth=1
filename=\\.\PhysicalDrive10

Things to know when using vdbench.

Recently I found that vdbench was not giving me the amount of outstanding IO that I had intended to configure by using the “threads=N” parameter. It turned out that with Linux, most of the filesystems (ext2, ext3 and ext4) do not support concurrent directIO, although they do support directIO. This was a bit of a shock coming from Solaris which had concurrent directIO since 2001.

All the Linux filesystems I tested allow multiple outstanding IO’s if the IO is submitted using asynchronous IO (AKA asyncIO or AIO) but not when using multiple writer threads (except XFS). Unfortunately vdbench does not allow AIO since it tries to be platform agnostic.

fio however does allow either threads or AIO to be used and so that’s what I used in the experiments below.

The column fio QD is the amount of outstanding IO, or Queue Depth that is intended to be passed to the storage device. The column iostat QD is the actual Queue Depth seen by the device. The iostat QD is not “8” because the response time is so low that fio cannot issue the IO’s quickly enough to maintain the intended queue depth.

Device
fio QD
fio QD Type
direct
iostat QD
 ps -efT | grep fio | wc -l
/dev/sd
8
libaio
Yes
7
5
/dev/sd
8
Threads
Yes
7
12
ext2 fs (mke2fs)
8
Threads
Yes
1
12
ext2 fs (mke2fs)
8
libaio
Yes
7
5
ext3 (mkfs -t ext3)
8
Threads
Yes
1
12
ext3 (mkfs -t ext3)
8
libaio
Yes
7
5
ext4 (mkfs -t ext4)
8
Threads
Yes
1
12
ext4 (mkfs -t ext4)
8
libaio
Yes
7
5
xfs (mkfs -t xfs)
8
Threads
Yes
7
12
xfs (mkfs -t xfs)
8
libaio
Yes
7
5

At any rate, all is not lost – using raw devices (/dev/sdX) will give concurrent directIO, as will XFS. These issues are well known by Linux DB guys, and I found interesting articles from Percona and Kevin Closson after I finally figured out what was going on with vdbench.

fio “scripts”

For the “threads” case.

[global]
bs=8k
ioengine=sync
iodepth=8
direct=1
time_based
runtime=60
numjobs=8
size=1800m

[randwrite-threads]
rw=randwrite
filename=/a/file1

For the “aio” case

[global]
bs=8k
ioengine=libaio
iodepth=8
direct=1
time_based
runtime=60
size=1800m


[randwrite-aio]
rw=randwrite
filename=/a/file1

Impact of Paravirtual SCSI driver VS LSI Emulation with Data.

TL;DR  Comparison of Paravirtual SCSI Vs Emulated SCSI in with measurements.  PVSCSI gives measurably better response times at high load. 

During a performance debugging session, I noticed that the response time on two of the SCSI devices was much higher than the others (Linux host under vmware ESX).  The difference was unexpected since all the devices were part of the same stripe doing a uniform synthetic workload.

iostat output from the system under investigation.

 

The immediate observation is that queue length is higher, as is wait time.  All these devices reside on the same back-end storage so I am looking for something else.  When I traced back the devices it turned out that the “slow devices” were attached to LSI emulated controllers in ESX.  Whereas the “fast devices” are attached to para-virtual controllers.

I was surprised to see how much difference using para virtual (PV) SCSI drivers made to the guest response time once IOPS started to ramp up.  In these plots the y-axis is iostat “await” time.  The x-axis is time (each point is a 3 second average).

PVSCSI = Gey Dots
LSI Emulated SCSI = Red Dots
Lower is better.

 

Each plot is from a workload which  uses a different offered IO rate.  The offered rates are   8000,9000 and 10,000 the storage is able to meet the rates even though latency increases because there is a lot of outstanding IO.  The workload is mixed read/write with bursts.

8K IOPS9K IOPS10K IOPS

 

After converting sdh and sdi to PV SCSI the response time is again uniform across all devices.

10K IOPS PV

Multiple devices/jobs in fio

If your underlying filesystem/devices have different response times (e.g. some devices are cached – or are on SSD) and others are on spinning disk, then the behavior of fio can be quite different depending on how the fio config file is specified.  Typically there are two approaches

1) Have a single “job” that has multiple devices

2) Make each device a “job”

With a single job, the iodepth parameter will be the total iodepth for the job (not per device) .  If multiple jobs are used (with one device per job) then the iodepth value is per device.

Option 1 (a single job) results in [roughly] equal IO across disks regardless of response time.  This is like  having a volume manager or RAID device, in that the overall oprate is limited by the slowest device.

For example, notice that even though the wait/response times are quite uneven (ranging from 0.8 ms to 1.5ms) the r/s rate is quite even.  You will notice though the that queue size is very variable so as to achieve similar throughput in the face of uneven response times.

Screen Shot 2014-06-17 at 5.34.55 PM

To get this sort of behavior use the following fio syntax. We allow fio to use up to 128 outstanding IO’s to distribute amongst the 8 “files” or in this case “devices”. In order to maintain the maximum throughput for the overall job, the devices with slower response times have more outstanding IO’s than the devices with faster response times.

[global]
bs=8k
iodepth=128
direct=1
ioengine=libaio
randrepeat=0
group_reporting
time_based
runtime=60
filesize=6G

[job1]
rw=randread
filename=/dev/sdb:/dev/sda:/dev/sdd:/dev/sde:/dev/sdf:/dev/sdg:/dev/sdh:/dev/sdi
name=random-read

The second option, give an uneven throughput because each device is linked to a separate job, and so is completely independent.  The  iodepth parameter is specific to each device, so every device has 16 outstanding IO’s.  The throughput (r/s) is directly tied to the response time of the specific device that it is working on.  So response times that are 10x faster generate throughput that is 10x faster.  For simulating real workloads this is probably not what you want.

For instance when sizing workingset and cache, the disks that have better throughput may dominate the cache.

Screen Shot 2014-06-17 at 5.32.52 PM

[global]
bs=8k
iodepth=16
direct=1
ioengine=libaio
randrepeat=0
group_reporting
time_based
runtime=60
filesize=2G

[job1]
rw=randread
filename=/dev/sdb
name=raw=random-read
[job2]
rw=randread
filename=/dev/sda
name=raw=random-read
[job3]
rw=randread
filename=/dev/sdd
name=raw=random-read
[job4]
rw=randread
filename=/dev/sde
name=raw=random-read
[job5]
rw=randread
filename=/dev/sdf
name=raw=random-read
[job6]
rw=randread
filename=/dev/sdg
name=raw=random-read
[job7]
rw=randread
filename=/dev/sdh
name=raw=random-read
[job8]
rw=randread
filename=/dev/sdi
name=raw=random-read