Install a bitnami image to Nutanix AHV cluster.

One of the nice things about using public cloud is the ability to use pre-canned application virtual appliances created by companies like Bitnami.

We can use these same appliance images on Nutanix AHV to easily do a Postgres database benchmark

Step 1. Get the bitnami image

wget  https://bitnami.com/redirect/to/587231/bitnami-postgresql-11.3-0-r56-linux-debian-9-x86_64.zip

Step 2. Unzip the file and convert the bitnami vmdk images to a single qcow2[1] file.

qemu-img convert *vmdk bitnami.qcow2

Put the bitnami.qcow2 image somewhere accessible to a browser, connected to the Prism service, then upload using the “Image Configuration”

Once the image is uploaded, it’s time to create a new VM based on that image

Once booted, you’ll see the bitnami logo and you can configure the bitnami passwords, enable ssh etc. using the console.

Enable/disable ssh in bitnami images
Connecting to Postgres in bitnami images
Note – when you “sudo -c postgres <some-psql-tool> the password it is asking for is the Postgres DB password (stored in ./bitnami-credentials) not any unix user password.

Once connected to the appliance we can use postgres and pgbench to generate simplistic database workload.

[1] Do this on a Linux box somewhere. For some reason the conversion failed using my qemu utilities installed via brew. Importing OVAs direct into AHV should be available in the future.

Creating compressible data with fio.

binary-code-507785_1280

Today I used fio to create some compressible data to test on my Nutanix nodes.  I ended up using the following fio params to get what I wanted.

 

buffer_compress_percentage=50
refill_buffers
buffer_pattern=0xdeadbeef
  • buffer_compress_percentage does what you’d expect and specifies how compressible the data is
  • refill_buffers Is required to make the above compress percentage do what you’d expect in the large.  IOW, I want the entire file to be compressible by the buffer_compress_percentage amount
  • buffer_pattern  This is a big one.  Without setting this pattern, fio will use Null bytes to achieve compressibility, and Nutanix like many other storage vendors will suppress runs of Zero’s and so the data reduction will mostly be from zero suppression rather than from compression.

Much of this is well explained in the README for latest version of fio.

Also NOTE  Older versions of fio do not support many of the fancy data creation flags, but will not alert you to the fact that fio is ignoring them. I spent quite a bit of time wondering why my data was not compressed, until I downloaded and compiled the latest fio.

 

SuperScalin’: How I learned to stop worrying and love SQL Server on Nutanix.

TL;DR  It’s pretty easy to get 1M SQL TPM running a TPC-C like workload on a single Nutanix node.  Use 1 vDisk for Log files, and 6 vDisks for data files.  SQL Server  needs enough CPU and RAM to drive it.  I used 16 vCPU’s  and 64G of RAM.

Running database servers on Nutanix is an increasing trend and DBA’s are naturally skeptical about moving their DB’s to new platforms.  I recently had the chance to run some DB benchmarks on a couple of nodes in our lab.  My goal was to achieve 1M SQL transactions per node, and have that be linearly scalable across multiple nodes.

Screen Shot 2014-11-26 at 5.50.58 PM

It turned out to be ridiculously easy to generate decent numbers using SQL Server.  As a Unix and Oracle old-timer it was a shock to me, just how simple it is to throw up a SQL server instance.  In this experiment, I am using Windows Server 2012 and SQL-Server 2012.

For the test DB I provision 1 Disk for the SQL log files, and 6 disks for the data files.  Temp and the other system DB files are left unchanged.  Nothing is tuned or tweaked on the Nutanix side, everything is setup as per standard best practices – no “benchmark specials”.

SQL Server TPCC Scaling

Load is being generated by HammerDB configured to run the OLTP database workload.  I get a little over 1Million SQL transactions per minute (TPM) on a single Nutanix node.  The scaling is more-or-less linear, yielding 4.2 Million TPM  with 4 Nutanix nodes, which fit in a single 2U chassis . Each node is running both the DB itself, and the shared storage using NDFS.  I stopped at 6 nodes, because that’s all I had access to at the time.

The thing that blew me away in this was just how simple it had been.  Prior to using SQL server, I had been trying to set up Oracle to do the same workload.  It was a huge effort that took me back to the 1990’s, configuring kernel parameters by hand – just to stand up the DB.  I’ll come back to Oracle at a later date.

My SQL Server is configured with 16 vCPU’s and 64GB of RAM, so that the SQL server VM itself has as many resources as possible, so as not to be the bottleneck.

I use the following flags on SQL server.  In SQL terminology these are known as traceflags which are set in the SQL console (I used “DBCC trace status” to display the following.  These are fairly standard and are mentioned in our best practice guide.

Screen Shot 2014-11-30 at 8.38.45 PM

One thing I did change from the norm was to set the target recovery time to 240 seconds, rather than let SQL server determine the recovery time dynamically.  I found that in the benchmarking scenario, SQL server would not do any background flushing at all,  and then suddenly would checkpoint a huge amount of data which caused the TPM to fluctuate wildly.  With the recovery time hard coded to 240 seconds, the background page flusher keeps up with the incoming workload, and does not need to issue huge checkpoints.  My guess is that in real (non benchmark conditions) SQL server waits for the incoming work to drop-off and issues the checkpoint at that time.  Since my benchmark never backs off, SQL server eventually has to issue the checkpoint.

Screen Shot 2014-11-26 at 5.39.07 PM

 

Multiple devices/jobs in fio

If your underlying filesystem/devices have different response times (e.g. some devices are cached – or are on SSD) and others are on spinning disk, then the behavior of fio can be quite different depending on how the fio config file is specified.  Typically there are two approaches

1) Have a single “job” that has multiple devices

2) Make each device a “job”

With a single job, the iodepth parameter will be the total iodepth for the job (not per device) .  If multiple jobs are used (with one device per job) then the iodepth value is per device.

Option 1 (a single job) results in [roughly] equal IO across disks regardless of response time.  This is like  having a volume manager or RAID device, in that the overall oprate is limited by the slowest device.

For example, notice that even though the wait/response times are quite uneven (ranging from 0.8 ms to 1.5ms) the r/s rate is quite even.  You will notice though the that queue size is very variable so as to achieve similar throughput in the face of uneven response times.

Screen Shot 2014-06-17 at 5.34.55 PM

To get this sort of behavior use the following fio syntax. We allow fio to use up to 128 outstanding IO’s to distribute amongst the 8 “files” or in this case “devices”. In order to maintain the maximum throughput for the overall job, the devices with slower response times have more outstanding IO’s than the devices with faster response times.

[global]
bs=8k
iodepth=128
direct=1
ioengine=libaio
randrepeat=0
group_reporting
time_based
runtime=60
filesize=6G

[job1]
rw=randread
filename=/dev/sdb:/dev/sda:/dev/sdd:/dev/sde:/dev/sdf:/dev/sdg:/dev/sdh:/dev/sdi
name=random-read

The second option, give an uneven throughput because each device is linked to a separate job, and so is completely independent.  The  iodepth parameter is specific to each device, so every device has 16 outstanding IO’s.  The throughput (r/s) is directly tied to the response time of the specific device that it is working on.  So response times that are 10x faster generate throughput that is 10x faster.  For simulating real workloads this is probably not what you want.

For instance when sizing workingset and cache, the disks that have better throughput may dominate the cache.

Screen Shot 2014-06-17 at 5.32.52 PM

[global]
bs=8k
iodepth=16
direct=1
ioengine=libaio
randrepeat=0
group_reporting
time_based
runtime=60
filesize=2G

[job1]
rw=randread
filename=/dev/sdb
name=raw=random-read
[job2]
rw=randread
filename=/dev/sda
name=raw=random-read
[job3]
rw=randread
filename=/dev/sdd
name=raw=random-read
[job4]
rw=randread
filename=/dev/sde
name=raw=random-read
[job5]
rw=randread
filename=/dev/sdf
name=raw=random-read
[job6]
rw=randread
filename=/dev/sdg
name=raw=random-read
[job7]
rw=randread
filename=/dev/sdh
name=raw=random-read
[job8]
rw=randread
filename=/dev/sdi
name=raw=random-read