High Response time and low throughput in vCenter performance charts.

Often we are presented with a vCenter screenshot, and an observation that there are “high latency spikes”.  In the example, the response time is indeed quite high – around 80ms.

Why is that?  In this case – there are bursts of large IO’s with high queue depths.  32 Outstanding IO’s and each IO is 1MB. For such a workload 80ms is OK.  By comparison for a 8K write with 1 outstanding IO, the response time would be closer to .8ms.  Anyhow, the workload here has a response time of 80ms,  however the throughput from the application (in this case fio) is a reasonable 400MB/s.

The problem is that 400MB/s is  not what vCenter reports.  Depending on the burst duration vCenter can vastly under-report the actual throughput.  In the worse case, vCenter reports the 80ms latency – for only ~17.5MB/s.  Littles law tells us that the expected throughput is (1/.08)*32 IOPS and since each IOP is 1MB in size – we should see roughly 400MB/s.

It turns out that vCenter is averaging the throughput over the sample time of 20 seconds – but the response time (since it is not a rate) is averaged over the number of IOPS in the time period – not the time period itself.   So, where the burst is less than 20 seconds, throughput is inaccurate (but response time is accurate).  This is not a criticism of vCenter – pretty much all monitoring software does this (including iostat at 1second granularity – where IO bursts are less than one second).

Output from fio our “Application”.

Our “application” which is fio, accurately records the achieved throughput and response times.

1 Second Burst

10 Second Burst

 

 

Author: gary

Performance hacker @ nutanix.com