Impact of Paravirtual SCSI driver VS LSI Emulation with Data.

TL;DR  Comparison of Paravirtual SCSI Vs Emulated SCSI in with measurements.  PVSCSI gives measurably better response times at high load. 

During a performance debugging session, I noticed that the response time on two of the SCSI devices was much higher than the others (Linux host under vmware ESX).  The difference was unexpected since all the devices were part of the same stripe doing a uniform synthetic workload.

iostat output from the system under investigation.

 

The immediate observation is that queue length is higher, as is wait time.  All these devices reside on the same back-end storage so I am looking for something else.  When I traced back the devices it turned out that the “slow devices” were attached to LSI emulated controllers in ESX.  Whereas the “fast devices” are attached to para-virtual controllers.

I was surprised to see how much difference using para virtual (PV) SCSI drivers made to the guest response time once IOPS started to ramp up.  In these plots the y-axis is iostat “await” time.  The x-axis is time (each point is a 3 second average).

PVSCSI = Gey Dots
LSI Emulated SCSI = Red Dots
Lower is better.

 

Each plot is from a workload which  uses a different offered IO rate.  The offered rates are   8000,9000 and 10,000 the storage is able to meet the rates even though latency increases because there is a lot of outstanding IO.  The workload is mixed read/write with bursts.

8K IOPS9K IOPS10K IOPS

 

After converting sdh and sdi to PV SCSI the response time is again uniform across all devices.

10K IOPS PV

Here be Zeroes

zero-cup

Many storage devices/filesystems treat blocks containing nothing but zeros in a special way, often short-circuiting reads from the back-end.  This is normally a good thing but this behavior can cause odd results when benchmarking.  This typically comes up when testing against storage using raw devices that have been thin provisioned.

In this example, I have several disks attached to my linux virtual machine.  Some of these disks contain data, but some of them have never been written to.

When I run an fio test against the disks, we can clearly see that the response time is better for some than for others.  Here’s the fio output…

fio believes that it is issuing the same workload to all disks.
fio believes that it is issuing the same workload to all disks.

and here is the output of iostat -x

Screen Shot 2014-06-12 at 4.29.23 PM

The devices sdf, sdg and sdh are thin provisioned disks that have never been written to. The read response times are be much lower.  Even though the actual storage is identical.

There are a few ways to detect that the data being read is all zero’s.

Firstly use a simple tool like unix “od” or “hd”  to dump out a small section of the disk device and see what it contains.  In the example below I just take the first 1000 bytes and check to see if there is any data.

Screen Shot 2014-06-12 at 4.58.15 PM

Secondly, see if your storage/filesystem has a way to show that it read zeros from the device.  NDFS has a couple of ways of doing that.  The easiest is to look at the 2009:/latency page and look for the stage “FoundZeroes”.

 

If your storage is returning zeros and so making your benchmarking problematic, you will need to get some data onto the disks!  Normally I just do a large sequential write with whatever benchmarking tool that I am using.  Both IOmeter and fio will write “junk” to disks when writing.