## Cache behavior – How long will it take to fill my cache?

When benchmarking filesystems or storage, we need to understand the caching effects. Most often this involves filling the cache and reaching steady state. But how long will it take to fill a cache of a given size? The answer depends of course on the size of the cache, the IO size and the IO rate. So, to simpify let’s just say that a cache consists of some number of entries. For instance a 4GB cache would have 1 million 4KB entries. In my example this is simply a 1M entry cache.

In terms of time to fill the cache, it’s simpler to think about how many entries will need to be read before the cache is filled.

For a random workload, it will be more than 1M “reads”. Let’s see why.

The first read will be inserted into the cache, the second read will probably be inserted into the cache, but there is a small (1/1000000) chance that the second read will actually be already in the cache since it’s random. As the cache gets fuller – the chances of a given read already being present in cache increases. As a result it will take a lot more than 1 million reads to populate the entire cache with a random read workload.

The question is this. Is is possible to predict, how many “reads” it will take to fill the cahe?

## The experiment.

In this experiment, we create an array to represent the cache. It has 1M entries. Then using a random number generator, simulate the workload and measure how long it takes to populate the cahche.

### Results

After 1,000,000 “reads” there are 633,000 positive entries (entries that have data in them). So what happened to the other 367,000? The 367,000 represent cache “hits” on an existing entry. Since the read “workload” is 100% random, there is some chance that a subsequent read will be for an entry that is already cached. Over the life of 1,000,000 reads around 37% are for an entry that is already cached.

After 2,0000,000 reads the cache contains 864,000 entries. Another 1,000,000 reads yields 950,000.

The fuller that the cache becomes, the fewer new entries are added. Intuitively this makes sense because as the cache becomes more full, more of the “random reads” are satisfied by an existing cache entry.

In my experiments it takes about 17,000,000 “reads” to ensure that every cache entry is filled in a 1M entry cache. Here are the data for 19 runs.

 Iteration Positive Entries Empty Slots 1 631998 368002 2 864334 135666 3 950184 49816 4 981630 18370 5 993266 6734 6 997577 2423 7 999080 920 8 999660 340 9 999879 121 10 999951 49 11 999985 15 12 999996 4 13 999998 2 14 999998 2 15 999999 1 16 999999 1 17 1000000 0 18 1000000 0
• For 500,000 Entries it takes 15 iterations to fill all the entries.
• For 2,000,000 Entries it takes 19 iterations to fill all the entries.

Interestingly, the ratio of positive to empty entries after one iteration is always about 0.632:0.368

• 0.368 is roughly 1/e
• .632 is roughly 1-(1/e).

## SQL*Server on Nutanix. Force backups to HDD.

As an experiment, I wanted to (a) Create a HDD only container, and (b) measure the bandwidth I could achieve when backing up the SQL DB.  This was performed on a standard hybrid platform with only 4 HDD’s in the node.

First create a container, but add the special options “sequential-io-priority-order=DAS-SATA random-io-priority-order=DAS-SATA” which means that all IO will be directed to the HDD only. This also means that data on this container will never be migrated up. This is just fine for a backup that will hopefully never be read, and if it is – only once, sequentially.

Next create a vDisk in that container – this disk will contain the SQL Server backup data

Format and initialize the drive.

Add backup targets to the drive. Adding multiple targets increases throughput because SQL Server will generate 1-2 outstanding IO’s per target. I created 16 total targets (these are just files).

The first backup is a little slow (~64MB/s), because we’re creating the files. A second (and subsequent) backups go faster, around  120 MB/s writing directly to the HDD spindles on a single node with 4 HDDs.

This backup stream drives around 25MB/s per HDD spindle on the Nutanix node.  On a larger platform with more spindles – we could easily drive 500MB/s, and still skip SSD by writing directly to HDD.

Completed backup:

## Things to know when using vdbench.

Recently I found that vdbench was not giving me the amount of outstanding IO that I had intended to configure by using the “threads=N” parameter. It turned out that with Linux, most of the filesystems (ext2, ext3 and ext4) do not support concurrent directIO, although they do support directIO. This was a bit of a shock coming from Solaris which had concurrent directIO since 2001.

All the Linux filesystems I tested allow multiple outstanding IO’s if the IO is submitted using asynchronous IO (AKA asyncIO or AIO) but not when using multiple writer threads (except XFS). Unfortunately vdbench does not allow AIO since it tries to be platform agnostic.

fio however does allow either threads or AIO to be used and so that’s what I used in the experiments below.

The column fio QD is the amount of outstanding IO, or Queue Depth that is intended to be passed to the storage device. The column iostat QD is the actual Queue Depth seen by the device. The iostat QD is not “8” because the response time is so low that fio cannot issue the IO’s quickly enough to maintain the intended queue depth.

 Device fio QD fio QD Type direct iostat QD ps -efT | grep fio | wc -l /dev/sd 8 libaio Yes 7 5 /dev/sd 8 Threads Yes 7 12 ext2 fs (mke2fs) 8 Threads Yes 1 12 ext2 fs (mke2fs) 8 libaio Yes 7 5 ext3 (mkfs -t ext3) 8 Threads Yes 1 12 ext3 (mkfs -t ext3) 8 libaio Yes 7 5 ext4 (mkfs -t ext4) 8 Threads Yes 1 12 ext4 (mkfs -t ext4) 8 libaio Yes 7 5 xfs (mkfs -t xfs) 8 Threads Yes 7 12 xfs (mkfs -t xfs) 8 libaio Yes 7 5

At any rate, all is not lost – using raw devices (/dev/sdX) will give concurrent directIO, as will XFS. These issues are well known by Linux DB guys, and I found interesting articles from Percona and Kevin Closson after I finally figured out what was going on with vdbench.