Why does my SSD not issue 1MB IO’s?

Published: March 13, 2020 (Updated: April 3, 2024) in Storage Performance, linux, benchmarking, ssd, storage, kernel by gary.

First things First

https://commons.wikimedia.org/wiki/File:CDC9762-smd-drive.jpg — CDC 9762 SMD disk drive from 1974

Why do we tend to use 1MB IO sizes for throughput benchmarking?

To achieve the maximum throughput on a storage device, we will usually use a large IO size to maximize the amount of data is transferred per IO request. The idea is to make the ratio of data-transfers to IO requests as large as possible to reduce the CPU overhead of the actual IO request so we can get as close to the device bandwidth as possible. To take advantage of and pre-fetching, and to reduce the need for head movement in rotational devices, a sequential pattern is used.

For historical reasons, many storage testers will use a 1MB IO size for sequential testing. A typical fio command line might look like something this.

fio --name=read --bs=1m --direct=1 --filename=/dev/sda

Identifying the actual IO size issued to the device

Even though we ask fio (or dd, or anything else for that matter) to issue a 1MB IO size, that does not mean that the actual IO size sent to disk will be 1MB. We can see the actual IO size from iostat. Take a look at the column “avgrq-sz” (average request size) which will tell you the IO average IO size that is sent to the device. Normally this is reported in “sectors” which are 512bytes in size.

8 Sectors = 4KB
128 Sectors = 64KB
1024 Sectors = 512KB
2048 Sectors = 1024KB (1MB)

Using iostat

In the output below, the requested IO size in fio is “1MB”. but the field avgrq-sz shows 1024, meaning that the average IO size going to the device is only 512K.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s  avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda              0.00     0.00   956.00    0.00  489472.00    0.00  1024.00     1.49    1.58    1.58    0.00   1.02  97.5

Why is there a discrepancy?

There are a number of parameters that determine how large the an IO will actually be passed down to the device. The main parameter is max_sectors_kb This parameter is set in the Linux Kernel and determines the maximum size in KB that will be passed down. In my case that value is set to 512 Maybe that’s why I see 512KB (1024 sectors) in iostat.

cat /sys/block/sdb/queue/max_sectors_kb 
512

root can increase the value of max_sectors_kb. Let’s try increasing the value to 1024 and see that fixes things…

# echo 1024 > /sys/block/sda/queue/max_sectors_kb

When I re-run the same fio command, the io size at the device is still 512K even though max_sectors is set to 1024

fio --name=read --bs=1m --direct=1 --filename=/dev/sda

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     3.00  922.00    8.00 472064.00    48.00  1024.00     1.48    1.61    1.61    1.50   1.03  95.40

So, even though we changed max_sectors_bk to 1024(k) we did not change the IO size going to disk. Clearly something else is happening

max_segments

It turns out that in this case, the reason for the 512K transfer size is that max_segments is set to 128.

A segment is a yet another piece of the puzzle. It is not the same as a block/sector which is usually 512 bytes. The value of max_segments is defined by probing the end device at boot time. Specifically it relates to the number of scatter-gather memory buffers it can translate. Since this is a device limitation, root cannot simply override with a larger value.

cat /sys/block/sda/queue/max_segments

128

How does max_segments of 128 relate to a 512K IO size? We need to know how large a segment is….

max_segment_size

To understand at the segment size, we can start by looking at max_segment_size

cat /sys/block/sda/queue/max_segment_size
65536

The max segment size is set to 64K – BUT the default (not the max) is the page size which is 4K on Intel CPUs. So the limit in this case is max_segments (128) X segment size (4K) == 512K. That’s why we see 512K per IO in the block trace and the iostat

Extra credit: Validating iostat “avgrq_sz” with Block trace

iostat has one limitation in that it shows the average IO size. To see the individual IO sizes, we need to use the block trace feature.

In the example below, we see that each IO is using 1024 sectors (512K) so in our case the average is 512K and each IO is in fact 512K in size.

   8,0    2      138     1.205313710 15037  Q   R 1783347200 + 1024 [fio]
   8,0    2      139     1.205314576 15037  G   R 1783347200 + 1024 [fio]
   8,0    2      140     1.205316160 15037  P   N [fio]
   8,0    2      141     1.205328957 15037  A   R 1783348224 + 1024 <- (8,1) 1783346176
   8,0    2      142     1.205329128 15037  Q   R 1783348224 + 1024 [fio]
   8,0    2      143     1.205330029 15037  G   R 1783348224 + 1024 [fio]

Summary

The value of max_segments is not changeable – so we are stuck (with this kernel) to 512K per IO size on this SSD. However we are able to achieve the advertised bandwidth from the spec sheet of 540MB/s with an outstanding IO of 8. So we can conclude that we don’t need 1024K per transfer to maximize throughput on this particular device.

READ: bw=507MiB/s (532MB/s), 507MiB/s-507MiB/s (532MB/s-532MB/s), io=29.7GiB (31.9GB), run=60063-60063msec

Update

It turns out that memory fragmentation plays a part here, and that’s why on the same system, same disk, same fio file – sometimes we see 1MB IO’s and at other times we do not.

The fact that /sys/block/sda/queue/max_segments * 4KB == 512KB means that the “minimum” IO size will be 512KB if we ask for 1MB. That’s because a segment is either a 4KB page – or a “string” of 4KB pages. Which in turn means – that if there are many regions of memory that can be strung together – then the IO submitted will be greater than 512KB (up to 1MB that we ask for).

We can work around this by forcing fio to use hugepages. In other words we can change the segment size from 4K to whatever size is configured for hugepages.

in fio we simply add the parameter iomem=shmhuge

Of course we need to have some hugepages – so we will also need to create them. A simple method is to add this line to /etc/sysctl.conf

vm.nr_hugepages = 128

Then ask linux to implement this with sysctl -p and then check to see if it worked using cat /proc/sys/vm/nr_hugepages

With these setting in place – Linux will always honor the 1MB request from user-spaace, as long as there are free hugepages available. In fact you can do

cat /proc/meminfo |grep -i huge

and see the huepage usage count increase when fio is spun up.