Storage bus speeds with example storage endpoints.
||Theoretical Bandwidth (MB/s)
||HBA <-> Single SATA Drive
||HBA <-> Single SAS Drive
||HBA <-> SAS/SATA Fanout
||4 Lane HBA to Breakout (6 SSD)
||HBA <-> SAS/SATA Fanout
||8 Lane HBA to Breakout (12 SSD)
||Single Lane PCIe3
||PCIe <-> SAS HBA or NVMe
||Enough for Single NVMe
||PICe <-> SAS HBA or NVMe
||Enough for SAS-3 4 Lanes
||PCIe Bus <-> Processor Socket
||Xeon Direct conect to PCIe Bus
All figures here are the theoretical maximums for the busses using rough/easy calculations for bits/s<->bytes/s. Enough to figure out where the throughput bottlenecks are likely to be in a storage system.
- SATA devices contain a single SAS/SATA port (connection), and even when they are connected to a SAS3 HBA, the SATA protocol limits each SSD device to ~600MB/s (single port, 6Gbit)
- SAS devices may be dual ported (two connections to the device from the HBA(s)) – each with a 12Gbit connection giving a potential bandwidth of 2x12Gbit == 2.4Gbyte/s (roughly) per SSD device.
- An NVMe device directly attached to the PCIe bus has access to a bandwidth of 4GB/s by using 4 PCIe lanes – or 8GB/s using 8 PCIe lanes. On current Xeon processors, a single socket attaches to 40 PCIe lanes directly (see diagram below) for a total bandwidth of 40GB/s per socket.
LSI SAS PCI Bottlenecks
- I first started down the road of finally coming to grips with all the different busses and lane types after reading this excellent LSI paper. I omitted the SAS-2 figures from this article since modern systems use SAS-3 exclusively.
Intel Processor & PCI connections
It’s good to detect corrupted data. It’s even better to transparently repair that data and return the correct data to the user. Here we will demonstrate how Nutanix filesystem detects and corrects corruption. Not all systems are made equally in this regard. The topic of corruption detection and remedy was the focus of this excellent Usenix paper Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions. The authors find that many systems that should in theory be able to recover corrupted data do not in fact do so.
Within the guest Virtual Machine
- Start with a Linux VM and write a specific pattern (0xdeadbeef) to /dev/sdg using fio.
- Check that the expected data is written to the virtual disk and generate a SHA1 checksum of the entire disk.
[root@gary]# od -x /dev/sdg
0000000 adde efbe adde efbe adde efbe adde efbe
[root@gary]# sha1sum /dev/sdg
- The “od” command shows us that the entire 1GB disk contains the pattern 0xdeadbeef
- The “sha1sum” command creates a checksum (digest) based on the content of the entire disk.
Within the Nutanix CVM
- Connect to the Nutanix CVM
- Locate one of the 4MB egroups that back this virtual disk on the node.
- The virtual disk which belongs to the guest vm (/dev/sdg) is represented in the Nutanix cluster as a series of “Egroups” within the Nutanix filesystem.
- Using some knowledge of the internals I can locate the Egroups which make up the vDisk seen by the guest.
- Double check that this is indeed an Egroup belonging to my vDisk by checking that it contains the expected pattern (0xdeadbeef)
nutanix@NTNX $ od -x 10808705.egroup
0000000 adde efbe adde efbe adde efbe adde efbe
- Now simulate a hardware failure and overwrite the egroup with null data
- I do this by reaching underneath the cluster filesystem and deliberately creating corruption, simulating a mis-directed write somewhere in the system.
- If the system does not correct this situation, the user VM will not read 0xdeadbeef as it expects – remember the corruption happened outside of the user VM itself.
nutanix@ $ dd if=/dev/zero of=10846352.egroup bs=1024k count=4
- Use the “dd” command to overwrite the entire 4MB Egroup with /dev/zero (the NULL character).
Back to the client VM
- We can tell if the correct results are returned by checking the checksums match the pre-corrupted values.
[root@gary-tpc tmp]# sha1sum /dev/sdg
1c763488abb6e1573aa011fd36e5a3e2a09d24b0 /dev/sdg. <— Same SHA1 digest as the "pre corruption" state.
- The checksum matches the original value – showing that the data in entire the vdisk is unchanged
- However we did change the vdisk by overwriting one of the. Egroups.
- The system has somehow detected and repaired the corruption which I induced.
- Nutanix keeps the checksums at an 8KB granularity as part of our distributed metadata. The system performs the following actions
- Detects that the checksums stored in metadata no longer match the data on disk.
- The stored checksums match were generated against “0xdeadbeef”
- The checksums generated during read be generated against <NULL>
- The checksums will not match and corrective action is taken.
- Nutanix OS
- Finds the corresponding un-corrupted Egroup on another node
- Copies the uncorrupted Egroup to a new egroup on the local node
- Fixes the metadata to point to the new fixed copy
- Removed corrupted egroup
- Returns the uncorrupted data to the user
Logs from the Nutanix VM
Here are the logs from Nutanix: notice group 10846352 is the one that we deliberately corrupted earlier
E0315 13:22:37.826596 12085 disk_WAL.cc:1684] Marking extent group 10846352 as corrupt reason: kSliceChecksumMismatch
I0315 13:22:37.826755 12083 vdisk_micro_egroup_fixer_op.cc:156] vdisk_id=10808407 operation_id=387450 Starting fixer op on extent group 10846352 reason -1 reconstruction mode 0 (gflag 0) corrupt replica autofix mode (gflag auto) consistency checks 0 start erasure overwrite fixer 0
I0315 13:22:37.829532 12086 vdisk_micro_egroup_fixer_op.cc:801] vdisk_id=10808407 operation_id=387449 Not considering corrupt replica 38 of egroup 10846352
- Data corruption can and does happen (see the above Usenix paper for some of the causes). When designing enterprise storage we have to deal with it
- Nutanix not only detects the corruption, it corrects it.
- In fact Nutanix OS continually scans the data stored on the cluster and makes sure that the stored data matches the expected checksums.
There are a lot of explanations for the current Meltdown/Spectre crisis but many did not do a good job of explaining the core issue if how information is leaked from the secret side, to the attackers side. This is my attempt to explain it (mostly to myself to make sure I got it right).
What is going on here generally?
- Users and the kernel are normally protected from bad-actors via privileged modes, address page tables and the MMU.
- It turns out that code executed speculatively can read any mapped memory. Even addresses/address that would not be readable in the normal program flow.
- Thankfully illegal reads from speculatively executed code are not accessible to the attacker.
- However, it turns out that we can execute a LOT of code in speculative mode if the pre-conditions are right.
- In fact modern instruction pipelines (and slow memory) allow >100 instructions to be executed while memory reads are resolved.
How does it work?
- The attacker reads the illegal memory using speculative execution, then uses the values read – to set data in cache lines that ARE LEGITIMATELY VISIBLE to the attacker. Thus creating a side channel between the speculatively executed code and the normal user written code.
- The values in the cache lines are not readable (by user code) – but the fact that the cache lines were loaded (or not) *IS* detectable (via timing) since the L3 cache is shared across address-space.
- First I ensure the cache lines I want to use in this process are empty.
- Then I setup some code that reads an illegal value (using speculative execution technique), and depending on whether that value is 0 or !=0 I would read some other (specific address in the attackers address space) that I know will be cached in cache-line 1. Pretend I execute the second read only if the illegal value is !=0
- Finally back in normal user code I attempt to read that same address in my “real” user space. And if I get a quick response – I know that the illegal value was !=0, because the only way I get a quick response is if the cache line was loaded during the speculative execution phase.
- It turns out we can encode an entire byte using this method. See below.
- The attacker reads a byte – then by using bit shifting etc. – the attacker encodes all 8 bits in 8 separate cache lines that can then be subsequently read.
- At this point an attacker has read a memory address he was not allowed to, encoded that value in shared cache-lines and then tested the existence or not of values in the cache lines via timing, and thus re-constructs the value encoded in them during the speculative phase.
- This is known as “leakage“.
- Broadly there are two phases in this technique
- The reading of illegal memory in speculative execution phase then encoding the byte in shared cache lines.
- Using timing of reads to those same cache lines to determine if they were “set” (loaded e.g.”1″) or unset (empty “0”) by the attacker to decode the byte from the (set/unset 1/0) cache lines.
- Side channels have been a known phenomena for years (at least since the 1990s) what’s different now if how easy, and with such little error rate – attackers are able to read arbitrary memory addresses.
I found these papers to be informative and readable.