Advanced X-Ray: reducing runtime by re-using VMs.
By: Date: October 5, 2020 Categories: X-Ray Tags: ,

Problem: For large datasets, creating the data on-disk can be time consuming.

Consider a cluster where we wish to write 2TB per node and replication is at 10GbE per node. Assuming we have enough storage bandwidth, the throughput is wire bound which makes our pre-fill stage 2,000s – over 1/2 hour.

Solution: Re-use the VMs across multiple tests.

There are two parts to the solution.

  1. Specify an “id” in the test.yml file
  2. Generate a random number for the workload ID in test.yml
  3. Remove any calls to cluster Teardown or Cluster Cleanup in test.tml of all workloads that you want to share the VMs across

Example

Firstly we create an ID that will tie together a set of VMs created in one X-Ray test to subsequent X-Ray tests. In this case I use the number 255. I will use the same “tag” id: 255 every time I want to re-use these VMs.

name: server_virt_simulator
display_name: "[Prefill] Server Virtualization simulator 75 per node"
summary: Sustained, fixed rate server virtualization 75 Per node.
id: 255

I also need to generate a random suffix to the workload ID using the jinja templating functionality of X-ray. You will not see this in the Prism name of the VM. It is used by the worker VMs to identify the different workloads it has been asked to (re) run.

estimated_runtime: {{ _estimated_runtime }}
{% set runid = range(1,9999999999)|random %}
presets:
small:
vars:
_estimated_runtime: 3600
_node_selector: ":n"
_iops_expected_value: 40
...
workloads:
SRV_VIRT_WLOAD {{ runid }}:
vm_group: SERVA
config_file: {{ workload_file }}

Now I can run my first workload which will setup and prefill the VMs for me. Then I can run my experimental workload(s) on the same set of VMs without having to wait for VMs to be cloned, prefilled and powered on.

So, why we don’t do this by default…?

Normally in X-ray the VMs are created from scratch every time. This stops us from having to worry about remembering how many and what size of disks are in use, how many CPU are allocated because those variables are embedded as part of the test. Put another way, since every aspect of the test is (a) created every time and (b) is documented in the test, I never have to keep track of anything about the worker VMs. The worker VMs are ephemeral and exist only for the duration of the test. Just like a microservice pattern. This kind of idempotency is the key to making benchmarking as code a reality.

…and why you might want to do it despite the drawbacks…

However, particularly during the research phase a benchmark developer might want to optimize for iteration velocity over test hygiene. That is why the ability to re-use X-Ray test VMs is not the default behavior but it is possible.

Prism Output

You can observe the change in naming from the Prism UI. The default X-ray worker VM naming scheme is

__curie_test_<random_number>_vm_group_name>_<index>

When we want to re-use the VMs created in one test by subsequent tests, we need to tell X-Ray which VMs to use (there may be hundreds of VMs on the system and potentially several called __curie_test_<something>

This is what the VM name looks like with the re-use feature. Rather than generating a random number, I specify my own ID (in this case) 255.

With re-use, Prism will show the X-Ray worker vms having names like
__curie_test_<my_id>_vm_group_name>_<index>

Acknowledgements:

Thanks to Bob Allegreti and Bill Eubanks for blazing the trail on this technique.

Leave a Comment