We have started seeing misaligned partitions on Linux guests runnning certain HDFS distributions. How these partitions became mis-aligned is a bit of a mystery, because the only way I know how to do this on Linux is to create a partition using old DOS format like this (using -c=dos and -u=cylinders) Continue reading
Often we are presented with a vCenter screenshot, and an observation that there are “high latency spikes”. In the example, the response time is indeed quite high – around 80ms. Continue reading
TL;DR Comparison of Paravirtual SCSI Vs Emulated SCSI in with measurements. PVSCSI gives measurably better response times at high load.
During a performance debugging session, I noticed that the response time on two of the SCSI devices was much higher than the others (Linux host under vmware ESX). The difference was unexpected since all the devices were part of the same stripe doing a uniform synthetic workload.
The immediate observation is that queue length is higher, as is wait time. All these devices reside on the same back-end storage so I am looking for something else. When I traced back the devices it turned out that the “slow devices” were attached to LSI emulated controllers in ESX. Whereas the “fast devices” are attached to para-virtual controllers.
I was surprised to see how much difference using para virtual (PV) SCSI drivers made to the guest response time once IOPS started to ramp up. In these plots the y-axis is iostat “await” time. The x-axis is time (each point is a 3 second average).
PVSCSI = Gey Dots
LSI Emulated SCSI = Red Dots
Lower is better.
Each plot is from a workload which uses a different offered IO rate. The offered rates are 8000,9000 and 10,000 the storage is able to meet the rates even though latency increases because there is a lot of outstanding IO. The workload is mixed read/write with bursts.
After converting sdh and sdi to PV SCSI the response time is again uniform across all devices.
Many storage devices/filesystems treat blocks containing nothing but zeros in a special way, often short-circuiting reads from the back-end. This is normally a good thing but this behavior can cause odd results when benchmarking. This typically comes up when testing against storage using raw devices that have been thin provisioned.
In this example, I have several disks attached to my linux virtual machine. Some of these disks contain data, but some of them have never been written to.
When I run an fio test against the disks, we can clearly see that the response time is better for some than for others. Here’s the fio output…
and here is the output of iostat -x
The devices sdf, sdg and sdh are thin provisioned disks that have never been written to. The read response times are be much lower. Even though the actual storage is identical.
There are a few ways to detect that the data being read is all zero’s.
Firstly use a simple tool like unix “od” or “hd” to dump out a small section of the disk device and see what it contains. In the example below I just take the first 1000 bytes and check to see if there is any data.
Secondly, see if your storage/filesystem has a way to show that it read zeros from the device. NDFS has a couple of ways of doing that. The easiest is to look at the 2009:/latency page and look for the stage “FoundZeroes”.
If your storage is returning zeros and so making your benchmarking problematic, you will need to get some data onto the disks! Normally I just do a large sequential write with whatever benchmarking tool that I am using. Both IOmeter and fio will write “junk” to disks when writing.