Notes on tuning postgres for cpu and memory benchmarking

Posted on October 18, 2024October 18, 2024 by gary

Recently I wanted to measure the impact of NUMA placement and Hugepages on the performance of postgres running in a VM on a Nutanix node. To do this I needed to drive postgres to do real transactions but have very little jitter/noise from the filesystem and storage. After reading a lot of blogs I came up with a process and set of postgres.conf tuneables that allowed me to run HammerDB TPROC workload (TPCC-C like) with very low variation around 0.3% variance (standard deviation/mean).

The tunings are not meant to represent best practices – and running repeatedly (without manually vacuuming, or doing a restore – will create problems because I am disabling autovacuum (see this discussion with HammerDB author Steve Shaw here and here)

Results

I have put the benchmark results below – but the main point of this post is to discuss the method which allows me to generate very repeatable postgres benchmark results where I can drive the CPU/Memory to be the limiting bottleneck. The screenshot below shows 5 runs back-to-back. From top to bottom the output shows

SQL commits per minute
Database VM CPU usage per core
Memory bandwidth (from Intel PCM running on the AHV hypervisor host)
Database VM IO rates

Multiple benchmark runs with consistent low jitter results

Continue reading →

Effect of POSIX_FADV_SEQUENTIAL and POSIX_FADV_RANDOM on IO performance.

Posted on August 22, 2024August 23, 2024 by gary

Previously we looked at how the POSIX_FADVISE_DONTNEED hint influences the Linux page cache when doing IO via a filesystem. Here we take a look at two more filesystem hints POSIX_FADV_RANDOM and POSIX_FADV_SEQUENTIAL

Continue reading →

Using fio to read from Linux buffer-cache

Posted on August 12, 2024August 19, 2024 by gary

Sometimes we want to read from the Linux cache rather than the underlying device using fio. There are couple of gotchas that might trip you up. Thankfully fio provides the required work-arounds.

TL;DR

To get this to work as expected (reads are serviced from buffer cache) – the best way is to use the option invalidate=0 in the fio file.

Continue reading →

Understanding QEMU devices

Posted on June 10, 2024June 10, 2024 by gary

Not sure where I came across this, but it is an excellent description of QEMU (and virtualization in general). I am very much a fan of this style of technical communication as exemplified in this final summary paragraph (the full article is longer):

In summary, even though QEMU was first written as a way of emulating hardware memory maps in order to virtualize a guest OS, it turns out that the fastest virtualization also depends on virtual hardware: a memory map of registers with particular documented side effects that has no bare-metal counterpart. And at the end of the day, all virtualization really means is running a particular set of assembly instructions (the guest OS) to manipulate locations within a giant memory map for causing a particular set of side effects, where QEMU is just a user-space application providing a memory map and mimicking the same side effects you would get when executing those guest instructions on the appropriate bare metal hardware.

https://www.qemu.org/2018/02/09/understanding-qemu-devices/

Using Prometheus and Grafana to monitor a Nutanix Cluster.

Posted on May 17, 2024June 10, 2024 by gary

Using a small python script we can liberate data from the “Analysis” page of prism element and send it to prometheus, where we can combine cluster metrics with other data and view them all on some nice Grafana dashboards.

Continue reading →

A Nutanix / Prometheus exporter in bash

Posted on May 3, 2024May 6, 2024 by gary

Overview

For a fun afternoon project, how about a retro prometheus exporter using Apache/nginx, cgi-bin and bash!?

About prometheus format

A Prometheus exporter simply has to return a page with metric names and metric values in a particular format like below.

ntnx_bash{metric="cluster_read_iops"} 0
ntnx_bash{metric="cluster_write_iops"} 1

When you configure prometheus via prometheus.yml you’re telling prometheus to visit a particular IP:Port over HTTP and ask for a page called metrics – so if the “page” called metrics is a script – the script just has to return (print) out data in the expected format – and prometheus will accept that as a basic “exporter”. The idea here is to write a very simple exporter in bash that connects to a Nutanix cluster – hits the stats API and returns IOPS data for a given container in the correct format.

Continue reading →

Linux memory monitoring (allocations Vs usage)

Posted on April 18, 2024April 23, 2024 by gary

How to use some of Linux’s standard tools and how different types of memory usage shows up.

Examples of using malloc and writing to memory with three use-cases for a simple process

No memory allocation at all no_malloc.c
A call to malloc() but memory is not written to malloc only.c
A call to malloc() and then memory is written to the allocated space malloc and write.c

In each case we run the example with a 64MB allocation so that we can see the usage from standard linux tools.

We do something like this

gary@linux:~/git/unixfun$ ./malloc_and_write 65536
Allocating 65536 KB
Allocating 67108864 bytes
The address of your memory is 0x7fa2829ff010
Hit <return> to exit

top
free
pmap
gdb

Continue reading →

Using iperf multi-stream may not work as expected

Posted on April 1, 2024April 1, 2024 by gary

Running iperf with parallel threads

TL;DR – When running iperf with parallel threads/workers the -P option must be specified after the -c <target-IP> option. This is mentioned in the manpage but some options (-t for instance) work in any order, while others (specifically the -P for parallel threads) definitely does not, which is a bit confusing.

For example – these two invocations of iperf give very different results

iperf -P 5 -c 10.56.68.97 (The -P before -c) -Yields 20.4 Gbits/sec
iperf -c 10.56.68.97 -P 5 (The -P after the -c)- Yields 78.3 Gbits/sec

Continue reading →

mpstat has an option to show utilization per NUMA node

Posted on January 8, 2024January 10, 2024 by gary

Not sure how long this has been a thing, but I recently discovered that mpstat takes a -N option for “NUMA Node” that works in the same way as -P for “Processor”. e.g. $ mpstat -N 0,1 1 will show stats for NUMA nodes 0 and 1 every 1 second. Just like mpstat -P ALL shows all processors mpstat -N ALL shows all NUMA nodes (and is easier to type).

The output looks like this

05:09:17 PM NODE    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
05:09:18 PM    0    1.13    0.00    9.30    0.00    0.28    0.15    0.00   31.78    0.00   57.21
05:09:18 PM    1    0.40    0.00    8.03    0.00    0.28    1.05    0.00   31.34    0.00   58.78
^C

Average:    NODE    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:       0    0.80    0.00    8.56    0.00    0.27    0.11    0.00   36.49    0.00   53.78
Average:       1    0.49    0.00   10.02    0.00    0.32    1.01    0.00   25.13    0.00   63.03

Using mpstat -N is is quite easy to check to see how the workload is distributed among the NUMA nodes of a multi-socket machine.

Running the ML-Perf Storage benchmark on Nutanix files.

Posted on September 15, 2023October 10, 2023 by gary

Some technical notes on our submission to the benchmark committee.

Background

For the past few months engineers from Nutanix have been participating in the MLPerf^tm Storage benchmark which is designed to measure the storage performance required for ML training workloads.

We are pleased with how well our general-purpose file-server has delivered against the demands of this high throughput workload.

Benchmark throughput and dataset

125,000 files
16.7 TB of data around 30% of the usable capacity (no “short stroking”)
Filesize 57-213MB per file
NFS v4 over Ethernet
5GB/s delivered per compute node from single NFSv4 mountpoint
25GB/s delivered across 5 compute nodes from single NFSv4 mountpoint

The dataset was 125,000 files consuming 16.7 TB, the file sizes ranged from 57 MB to 213 MB. There is no temporal or spatial hotspot (meaning that the entire dataset is read) so there is no opportunity to cache the data in DRAM – the data is being accessed from NVMe flash media. For our benchmark submission we used standard NFSv4, standard Ethernet using the same Nutanix file-serving software that already powers everything from VDI users home-directories, medical images and more. No infiniband or special purpose parallel filesystems were harmed in this benchmark submission.

Continue reading →

Viewing Nutanix cluster metrics in prometheus/grafana

Posted on July 31, 2023August 1, 2023 by gary

Using Nutanix API with prometheus push-gateway.

Many customers would like to view their cluster metrics alongside existing performance data using Prometheus/Grafana

Currently Nutanix does not provide a native exporter for Prometheus to use as a datasource. However we can use the prometheus push-gateway and a simple script which pulls from the native APIs to get data into prometheus. From there we can use Grafana or anything that can connect to Prometheus.

The goal is to be able to view cluster metrics alongside other Grafana dashboards. For example show the current Read/Write IOPS that the cluster is delivering on a per container basis. I’m hard-coding IPs and username/passwords in the script which obviously is not production grade, so don’t do that.

Continue reading →

Effects of CPU topology on sqlserver guests with AHV.

Posted on February 28, 2023February 28, 2023 by gary

VM CPU Topology

The topology (layout) that AHV presents virtual Sockets/CPU to the guest operating system will usually be different than the physical topology. This is expected because we typically present a subset of all cores to the guest VMs.

Usually it is the total number of vCPU given to the VM that matters, not the specific topology, but in the case of SQLserver running an analytical workload (a TPC-H like workload from HammerDB) the topology passed to the VM does make a difference. Between 10% and 20% when measured by the total runtime.

[I think that the reason we see a difference here is that (a) the analytical workloads use hardly any storage bandwidth (I sized the database to fit in memory) and (b) there is probably a lot of cross-talk between the cores/memory as the DB engine issues parallel queries.]

At any rate we see that passing 20 cores as “20 sockets of 1 core” beats the performance of “1 socket with 20 cores” by a wide margin. The physical topology is two sockets of 20 cores on each socket. Thankfully the better performing option is the default.

CPU Topology may make a difference for SQL server running analytical workloads.

Continue reading →

fio versions < 3.3 may show inflated random write performance

Posted on February 15, 2023February 15, 2023 by gary

TL;DR

If your storage system implements inline compression, performance results with small IO size random writes with time_based and runtime may be inflated with fio versions < 3.3 due to fio generating unexpectedly compressible data when using fio’s default data pattern. Although unintuitive, performance can often be increased by enabling compression especially if the bottleneck is on the storage media, replication or a combination of both.

fio 2.8.1 vs fio 3.33 data patterns

Therefore if you are comparing performance results generated using fio version < 3.3 and fio >=3.3 the random write performance on the same storage platform my appear reduced with more recent fio versions.

fio-3.3 was released in December 2017 but older fio versions are still in use particularly on distributions with long term (LTS) support. For instance Ubuntu 16, which is supported until 2026 ships with fio-2.2.10

Continue reading →

Specifying Drive letters with fio for Windows.

Posted on December 29, 2022December 29, 2022 by gary

fio on Windows

Download pre-compiled fio binary for Windows

Example fio windows file, single drive

This will create a 1GB file called fiofile on the F:\ Drive in Windows then read the file. Notice that the specification is “Driveletter” “Backslash” “Colon” “Filename”

In fio terms we are “escaping” the : which fio traditionally uses as a file separator.

[global]
bs=1024k
size=1G
time_based
runtime=30
rw=read
direct=1
iodepth=8

[job1]
filename=F\:fiofile

Continue reading →

Hunting for bandwidth on a consumer NVMe drive

Posted on December 22, 2022February 15, 2023 by gary

The Samsung SSD 970 EVO 500GB claims a sequential read bandwidth of 3400 MB/s this is a story of trying to achieve that number.

Continue reading →

Database sizes for HammerDB TPC-C/ SQLserver

Posted on November 17, 2022December 29, 2022 by gary

The on disk size for small DB sizes. Taken from SQLserver properties immediately after creating the TPC-C like schema in HammerDB and then using server and then using Tasks->Shrink->Database.

Warehouse Count	Database size
10	826 MB
100	8,057 MB

How to monitor SQLServer on Windows with Prometheus

Posted on October 17, 2022June 26, 2024 by gary

TL;DR

Enable SQLServer agent in SSMS
Install the Prometheus Windows exporter from github the installer is in the Assets section near the bottom of the page
Install Prometheus scraper/database to your monitoring server/laptop via the appropriate installer
Point a browser to the prometheus server e.g.
:9090
- Add a new target, which will be the Windows exporter installed in step.
- It will be something like <SQLSERVERIP>:9182/metrics
- Ensure the Target shows “Green”
Check that we can scrape SQLserver tranactions. In the search/execute box enter something like this
rate(windows_mssql_sqlstats_batch_requests[30s])*60
Put the SQLserver under load with something like HammerDB
Hit Execute on the Prometheus server search box and you should see a transaction rate similar to HammerDB
Install Grafana and Point it to the Prometheus server (See multiple examples of how to do this)

Continue reading →

Generate load on Microsoft SQLserver Windows from HammerDB on Linux

Posted on October 14, 2022December 29, 2022 by gary

HammerDB on Linux driving load to Windows SQL Server

Often it’s nice to be able to drive Windows applications and databases from Linux, especially if you are more comfortable in a Unix environment. This post will show you how to drive a Microsoft SQL Server database running on a Windows server from a remote Linux machine. In this example I am using Ubuntu 22.04, SQLserver 2019, Windows 11 and HammerDB 4.4

Continue reading →

QCOW 3 Ways

Posted on September 13, 2022September 14, 2022 by gary

How to mount QCOW images as Linux block devices

guestmount
losetup
nbd

tl;dr

guestmount (requires libguestfs-tools) sudo guestmount -d <vm-name> --ro -i <mountpoint>
qemu-nbd (requires the nbd driver)
- Load the kernel module modprobe nbd max_part=8
- Bind the device to the image qemu-nbd --connect=/dev/nbd0 <vmdiskimage.qcow>
- Assuming partition #1 is the target mount /dev/ndb0p1 /a
loopback mount. Requires converting qcow to raw
- Convert qcow to raw qemu-img convert vmdisk.qcow2 -f qcow2 -O raw vmdisk.raw
- Create a loopback device losetup -f -P vmdisk.raw
- Locate name of loopback device losetup -l | grep vmdisk.raw
- Mount (assuming partition #1 on loopback device 99 mount /dev/mapper/loop99p1 /a

Continue reading →

Create a Linux VM with KVM in 6 easy steps

Posted on September 10, 2022November 5, 2022 by gary

A Step-by-step guide to creating a Linux virtual machine on a Linux host with KVM,qemu,libvirt and ubuntu cloud images.

Continue reading →