scsi Archives | n0derunner

The question of why Nutanix uses SATA drive comes up sometimes, especially from customers who have experienced very poor performance using SATA on traditional arrays.

I can understand this anxiety. In my time at NetApp we exclusively used SAS or FC-AL drives in performance test work. At the time there was a huge difference in performance between SCSI and SATA. Even a few short years ago, FC typically spun at 15K RPM whereas SATA was stuck at about a 5K RPM, so experiencing 3X the rotational delay.

These days SAS and SATA are both available in 7200 RPM configurations, and these are the type we use in standard Nutanix nodes. In fact the SATA drives that we use are marketed by Seagate as “Nearline SAS” or NL-SAS. Mainly to differentiate them from the consumer grade SATA drives that are found in cheap laptops. There are hundreds of SAS Vs SATA articles on the web, so I won’t go over the theoretical/historical arguments.

SATA in Hybrid/Tiered Storage

In a Nutanix cluster the “heavy lifting” of IO is mainly done by the SSD’s – leaving the SATA drives to service the few remaining IO’s that miss the SSD tier. Under moderate load, the SATA spindles do pretty well, and since the SATA $/GB is only 60% of SAS. SATA seems like a good choice for mostly-cold data.

Let’s Experiment.

From a performance perspective, I decided to run a few experiments to see just how well SATA performs. In the test, the SATA drives are Nutanix standard drives “ST91000640NS” (Seagate, priced around $150). The comparable SAS drives are the same form-factor (2.5 Inch) “AL13SEB900” (Toshiba, priced at about $250 USD). These drives spin at 10K RPM. Both drives hold around 1TB.

There are three experiments per drive type to reveal the impact of seek-times. This is achieved using the “filesize” parameter of fio – which determines the LBA range to read. One thing to note, is that I use a queue-depth of one. Therefore IOPs can be calculated as simply 1/Response-Time (converted to seconds).

[global]
bs=8k
rw=randread 
iodepth=1 
ioengine=libaio 
time_based 
runtime=10 
direct=1 
filesize=1g 

[randread]
filename=/dev/sdf1

Random Distribution. SATA Vs SAS

Working Set Size	7.2K RPM SATA Response Time (ms)	10K RPM SAS Response Time (ms)
1 GB	5.5	4
100 GB	7.5	4.5
1000 GB	12.5	7

Zipf Distribution. SATA Only.

Working Set Size	Response Time (ms)
1000G	8.5

Somewhat intuitively as the working-set (seek) gets larger, the difference between “Real SAS” and “NL-SAS/SATA” gets wider. This is intuitive because with a 1GB working-set, the seek-time is close to zero, and so only the rotational delay (based on RPM) is a factor. In fact the difference in response time is the same as the difference in rotational speed (1:1.3).

Also (just for fun) I used the “random_distribution=zipf” function in fio to test the response time when reading across the entire range of the disk – but with a “hotspot” (zipf) rather than a uniform random read – which is pretty unrealistic.

In the “realistic” case – reading across the entire disk on the SATA drives shipped with Nutanix nodes is capable of 8.5 ms response time at 125 IOPS per spindle.

Conclusion

The performance difference between SAS and SATA is often over-stated. At moderate loads SATA performs well enough for most use-cases. Even when delivering fully random IO over the entirety of the disk – SATA can deliver 8K in less than 15ms. Using a more realistic (not 100% random) access pattern the response time is < 10ms.

For a properly sized Nutanix implementation, the intent is to service most IO from Flash. It’s OK to generate some work on HDD from time-to-time even on SATA.

TL;DR Comparison of Paravirtual SCSI Vs Emulated SCSI in with measurements. PVSCSI gives measurably better response times at high load.

During a performance debugging session, I noticed that the response time on two of the SCSI devices was much higher than the others (Linux host under vmware ESX). The difference was unexpected since all the devices were part of the same stripe doing a uniform synthetic workload.

iostat output from the system under investigation.

The immediate observation is that queue length is higher, as is wait time. All these devices reside on the same back-end storage so I am looking for something else. When I traced back the devices it turned out that the “slow devices” were attached to LSI emulated controllers in ESX. Whereas the “fast devices” are attached to para-virtual controllers.

I was surprised to see how much difference using para virtual (PV) SCSI drivers made to the guest response time once IOPS started to ramp up. In these plots the y-axis is iostat “await” time. The x-axis is time (each point is a 3 second average).

PVSCSI = Gey Dots
LSI Emulated SCSI = Red Dots
Lower is better.

Each plot is from a workload which uses a different offered IO rate. The offered rates are 8000,9000 and 10,000 the storage is able to meet the rates even though latency increases because there is a lot of outstanding IO. The workload is mixed read/write with bursts.

After converting sdh and sdi to PV SCSI the response time is again uniform across all devices.

10K IOPS PV

n0derunner

scsi

SATA on Nutanix. Some experimental data.

SATA in Hybrid/Tiered Storage

Let’s Experiment.

Random Distribution. SATA Vs SAS

Zipf Distribution. SATA Only.

Conclusion

Impact of Paravirtual SCSI driver VS LSI Emulation with Data.