The return of misaligned IO

We have started seeing misaligned partitions on Linux guests runnning certain HDFS distributions.  How these partitions became mis-aligned is a bit of a mystery, because the only way I know how to do this on Linux is to create a partition using old DOS format like this (using -c=dos  and -u=cylinders)

$ sudo fdisk -c=dos -u=cylinders /dev/sdc

When taking the offered defaults, the result is that the partition begins 1536 Bytes into the (virtual disk) – meaning that all writes appear offset by 1536.  e.g. a 1MB write that is intended to be at offset 104857600 is actually sent to the back-end storage as 104859136.  Typically underlying storage is aligned to 4K at a minumum, and often larger.  In terms of performance this means that every write that should simply overwrite the existing data needs to read the first 1536 bytes, then merge the new data.  On SSD systems this results in CPU increase to do the work – on HDD systems the  HDD’s themselves become busier – and throughput reduces – sometimes drastically.  In the experiment below, I can achieve only 60% of the performance when mis-aligned.

Am I mis-aligned?

The key to telling is to look at the partition start offset.  If the offset is 1536B – then yes.

$ sudo parted -l
Model: VMware Virtual disk (scsi)
Disk /dev/sda: 2199GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number Start End Size Type File system Flags
 1 1536B 2199GB 2199GB primary

Performance Impact

fio shows the thoughput to a misaligned partition  as almost half of the correctly aligned partition.

Starting 4 processes

wr1: (groupid=0, jobs=4): err= 0: pid=1515: Wed May 24 17:32:54 2017
 write: IOPS=210, BW=211MiB/s (221MB/s)(8192MiB/38913msec)
 slat (usec): min=42, max=337, avg=114.38, stdev=26.72
Starting 4 processes

wr1: (groupid=0, jobs=4): err= 0: pid=1414: Wed May 24 17:40:51 2017
 write: IOPS=129, BW=129MiB/s (135MB/s)(8192MiB/63472msec)
 slat (usec): min=38, max=243, avg=111.12, stdev=26.75

iostat shows interesting/weird behavior.  What’s curious here is that while we do see the 60% delta in wsec/s  the await, r_await and w_await would completely throw you off, without knowing what’s happening.  In the misaligned case  – the w_await is lower than the aligned case.  That cannot be true.  It’s as if the await time average went down (which is expected since with misalignement, we do a small read for every large write) and the r_await and w_await are simply calculated pro-rata from the r/s and w/s.

Aligned: (IO destined for /dev/sd[cdef])

Misaligned  (IO destined for /dev/sd[cdef])

Observing in Nutanix: Correctly Aligned  (http://CVMIP:2009/vdisk_stats)

Observing in Nutanix: Misaligned  (http://CVMIP:2009/vdisk_stats)



High Response time and low throughput in vCenter performance charts.

Often we are presented with a vCenter screenshot, and an observation that there are “high latency spikes”.  In the example, the response time is indeed quite high – around 80ms.

Why is that?  In this case – there are bursts of large IO’s with high queue depths.  32 Outstanding IO’s and each IO is 1MB. For such a workload 80ms is OK.  By comparison for a 8K write with 1 outstanding IO, the response time would be closer to .8ms.  Anyhow, the workload here has a response time of 80ms,  however the throughput from the application (in this case fio) is a reasonable 400MB/s.

The problem is that 400MB/s is  not what vCenter reports.  Depending on the burst duration vCenter can vastly under-report the actual throughput.  In the worse case, vCenter reports the 80ms latency – for only ~17.5MB/s.  Littles law tells us that the expected throughput is (1/.08)*32 IOPS and since each IOP is 1MB in size – we should see roughly 400MB/s.

It turns out that vCenter is averaging the throughput over the sample time of 20 seconds – but the response time (since it is not a rate) is averaged over the number of IOPS in the time period – not the time period itself.   So, where the burst is less than 20 seconds, throughput is inaccurate (but response time is accurate).  This is not a criticism of vCenter – pretty much all monitoring software does this (including iostat at 1second granularity – where IO bursts are less than one second).

Output from fio our “Application”.

Our “application” which is fio, accurately records the achieved throughput and response times.

1 Second Burst
wr3: (g=0): rw=write, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
Starting 1 process

wr3: (groupid=0, jobs=1): err= 0: pid=5046: Tue May 16 15:54:23 2017
 write: io=436224KB, bw=404285KB/s, iops=394, runt= 1079msec
 slat (usec): min=58, max=180, avg=129.55, stdev=18.61
 clat (msec): min=10, max=151, avg=80.58, stdev=16.41
 lat (msec): min=10, max=151, avg=80.71, stdev=16.41
10 Second Burst
wr3: (g=0): rw=write, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
Starting 1 process

wr3: (groupid=0, jobs=1): err= 0: pid=5029: Tue May 16 15:54:00 2017
 write: io=4005.0MB, bw=407019KB/s, iops=397, runt= 10076msec
 slat (usec): min=57, max=242, avg=132.72, stdev=19.73
 clat (msec): min=17, max=161, avg=80.34, stdev= 7.05
 lat (msec): min=17, max=161, avg=80.47, stdev= 7.05



Hyperconverged File Systems PT1 Taxonomy

One way of categorizing Hyperconverged filesystems (or any filesystem really) is by how data is distributed across the nodes, and the method used to track/retrieve that data. The following is based on knowledge of the internals of Nutanix and publicly available information for the other systems.



Implemented by

Distributed data & metadata



Random data distribution, hash-lookup (object store)



Data stored in HA-Pairs, Lookup by fingerprint



Random data distribution, Lookup by fingerprint


Psuedo Distributed

Data stored in HA pairs, Unified namespace via redirection

NetApp C-Mode
    Nutanix uses a fully distributed metadata layer that allows the cluster to decide where to place data depending on the location of the VM accessing it. The data can move around to follow the VM. The Nutanix FS uses a lot of ideas from distributed systems research and implementation, rather than taking a classic filesystems approach and applying it to HCI.

Creating compressible data with fio.


Today I used fio to create some compressible data to test on my Nutanix nodes.  I ended up using the following fio params to get what I wanted.


  • buffer_compress_percentage does what you’d expect and specifies how compressible the data is
  • refill_buffers Is required to make the above compress percentage do what you’d expect in the large.  IOW, I want the entire file to be compressible by the buffer_compress_percentage amount
  • buffer_pattern  This is a big one.  Without setting this pattern, fio will use Null bytes to achieve compressibility, and Nutanix like many other storage vendors will suppress runs of Zero’s and so the data reduction will mostly be from zero suppression rather than from compression.

Much of this is well explained in the README for latest version of fio.

Also NOTE  Older versions of fio do not support many of the fancy data creation flags, but will not alert you to the fact that fio is ignoring them. I spent quite a bit of time wondering why my data was not compressed, until I downloaded and compiled the latest fio.