Saturday, February 24, 2018

Are FPGAs the answer to HPC's woes?

Executive Summary

Not yet.  I'll demonstrate why no domain scientist would ever want to program in Verilog, then highlight a few promising directions of development that are addressing this fact.

The usual disclaimer also applies: the opinions and conjectures expressed below are mine alone and not those of my employer.  Also I am not a computer scientist, so I probably don't know what I'm talking about.  And even if it seems like I do, remember that I am a storage architect who is wholly unqualified to speak on applications and processor performance.

Premise

We're now in an age where CPU cores aren't getting any faster, and the difficulties of shrinking processes below 10 nm means we can't really pack any more CPU cores on a die.  Where's performance going to come from if we ever want to get to exascale and beyond?

Some vendors are betting on larger and larger vectors--ARM (with its Scalable Vector Extensions) and NEC (with its Aurora coprocessors) are going down this path.  However, algorithms that aren't predominantly dense linear algebra will need very efficient scatter and gather operations that can pack vector registers quickly enough to make doing a single vector operation worthwhile.  For example, gathering eight 64-bit values from different parts of memory to issue an eight-wide (512-bit) vector multiply requires pulling eight different cache lines--that's moving 4096 bits of memory for what amounts to 512 bits of computation.  In order to continue scaling vectors out, CPUs will have to rethink how their vector units interact with memory.  This means either (a) getting a lot more memory bandwidth to support these low flops-per-byte ratios, or (b) pack vectors closer to the memory so that pre-packed vectors can be fetched through the existing memory channels.

Another option to consider are GPUs, which work around the vector packing issue by implementing a massive numbers of registers and giant crossbars to plumb those bytes into arithmetic units.  Even then, though, relying on a crossbar to connect compute and data is difficult to continue scaling; the interconnect industry gave up on this long ago, which is why today's clusters now connect hundreds or thousands of crossbars into larger fat trees, hypercubes, and dragonflies.  GPUs are still using larger and larger crossbars--NVIDIA's V100 GPU is one of the physically largest single-die chips ever made--but there's an economic limit to how large a die can be.

This bleak outlook has begun to drive HPC designers towards thinking about smarter ways to use silicon.  Rather than build a general-purpose processor that can do all multiplication and addition operations at a constant rate, the notion is to bring hardware design closer to the algorithms being implemented.  This isn't a new idea (for example, RIKEN's MDGRAPE and DESRES's Anton are famous examples of purpose-built chips for specific scientific application areas), but this approach historically has been very expensive relative to just using general-purpose processor parts.  Only now are we at a place where special-purpose hardware may be the only way to sustain HPC's performance trajectory.

Given the diversity of applications that run on the modern supercomputer though, expensive and custom chips that only solve one problem aren't very appetizing.  A close compromise are FPGAs though, and there has been a growing buzz surrounding the viability of relying on FPGAs in mainstream HPC workloads.

Many of us non-computer scientists in the HPC business only have a vague and qualitative notion of how FPGAs can realistically be used to carry out computations, though.  Since there is growing excitement around FPGAs for HPC as exascale approaches though, I set out to get my hands dirty and figure out how they might fit in the larger HPC ecosystem.

Crash course in Verilog

Verilog can be very difficult to grasp for people who already know how to program languages like C or Fortran (like me!).  On the one hand, it looks a bit like C in that has variables to which values can be assigned, if/then/else controls, for loops, and so on.  However these similarities are deceptive because Verilog does not execute like C; whereas a C program executes code line by line, one statement after the other, Verilog sort of execute all of the lines at the same time, all the time.

A C program to turn an LED on and off repeatedly might look like:

where the LED is turned on, then the LED is turned off, then we repeat.

In Verilog, you really have to describe what components your program will have and how they are connected. In the most basic way, the code to blink an LED in Verilog would look more like


Whereas C is a procedural language in that you describe a procedure for solving a problem, Verilog is more like a declarative language in that you describe how widgets can be arranged to solve the problem.

This can make tasks that are simple to accomplish in C comparatively awkward in Verilog. Take our LED blinker C code above as an example; if you want to slow down the blinking frequency, you can do something like


Because Verilog is not procedural, there is no simple way to say "wait a second after you turn on the LED before doing something else." Instead, you have to rely on knowing how much time passes between consecutive clock signals (clk incrementing).

For example, the DE10-Nano has a 50 MHz clock generator, so every 1/(50 MHz) (20 nanoseconds), and everything time-based has to be derived from this fundamental clock timer. The following Verilog statement:


indicates that every 20 ns, increment the cnt register (variable) by one. To make the LED wait for one second after the LED is turned on, we need to figure out a way to do nothing for 50,000,000 clock cycles (1 second / 20 nanoseconds). The canonical way to do this is to
  1. create a big register that can store a number up to 50 million
  2. express that this register should be incremented by 1 on every clock cycle
  3. create a logic block that turns on the LED when our register is larger than 50 million
  4. rely on the register eventually overflowing to go back to zero
If we make cnt a 26-bit register, it can count up to 67,108,864 different numbers and our Verilog can look something like


However, we are still left with two problems:
  1. cnt will overflow back to zero once cnt surpasses 226 - 1
  2. We don't yet know how to express how the LED is connected to our FPGA and should be controlled by our circuit
Problem #1 (cnt overflows) means that the LED will stay on for exactly 50,000,000 clock cycles (1 second), but it'll turn off for only 226 - 1 - 50,000,000 cycles (17,108,860 cycles, or 0.34 seconds). Not exactly the one second on, one second off that our C code does.

Problem #2 is solved by understanding the following:

  • our LED is external to the FPGA, so it will be at the end of an output wire
  • the other end of that output wire must be connected to something inside our circuit--a register, another wire, or something else

The conceptually simplest solution to this problem is to create another register (variable), this time only one bit wide, in which our LED state will be stored. We can then change the state of this register in our if (cnt > 5000000) block and wire that register to our external LED:


Note that our assign statement is outside of our always @(posedge clk) block because this assignment--connecting our led output wire to our led_state register--is a persistent declaration, not the assignment of a particular value. We are saying "whatever value is stored in led_state should always be carried to whatever is on the other end of the led wire." Whenever led_state changes, led will simultaneously change as a result.

With this knowledge, we can actually solve Problem #1 now by
  1. only counting up to 50 million and not relying on overflow of cnt to turn the LED on or off, and
  2. overflowing the 1-bit led_state register every 50 million clock cycles
Our Verilog module would look like


and we accomplish the "hello world" of circuit design:


This Verilog is actually still missing a number of additional pieces and makes very inefficient use of the FPGA's hardware resources. However, it shows how awkward it can be to express a simple, four-line procedural program using a hardware description language like Verilog.

So why bother with FPGAs at all?

It should be clear that solving a scientific problem using a procedural language like C is generally more straightforward than with a declarative language like Verilog. That ease of programming is made possible by a ton of hardware logic that isn't always used, though.

Consider our blinking LED example; because the C program is procedural, it takes one CPU thread to walk through the code in our program. Assuming we're using a 64-core computer, that means we can only blink up to 64 LEDs at once. On the other hand, our Verilog module consumes a tiny number of the programmable logic blocks on an FPGA. When compiled for a $100 hobbyist-grade DE10-Nano FPGA system, it uses only 21 of 41,910 programmable blocks, meaning it can control almost 2,000 LEDs concurrently**. A high-end FPGA would easily support tens of thousands.

The CM2 illuminated an LED whenever an operation was in flight. Blinking the LED in Verilog is easy.  Reproducing the CM2 microarchitecture is a different story.  Image credit to Corestore.
Of course, blinking LEDs haven't been relevant to HPC since the days of Connection Machines, but if you were to replace LED-blinking logic with floating point arithmetic units, the same conclusions apply.  In principle, a single FPGA can process a huge number of FLOPS every cycle by giving up its ability to perform many of the tasks that a more general-purpose CPU would be able to do.  And because FPGAs are reprogrammable, they can be quickly configured to have an optimal mix of special-purpose parallel ALUs and general purpose capabilities to suit different application requirements.

However, the fact that the fantastic potential of FPGAs hasn't materialized into widespread adoption is a testament to how difficult it is to bridge the wide chasm between understanding how to solve a physics problem and understanding how to design a microarchitecture.

Where FPGAs fit in HPC today

To date, a few scientific domains have had success in using FPGAs.  For example,

The success of these FPGA products is due in large part to the fact that the end-user scientists don't ever have to directly interact with the FPGAs.  In the case of experimental detectors, FPGAs are sufficiently close to the detector that the "raw" data that is delivered to the researcher has already been processed by the FPGAs.  Convey and Edico products incorporate their FPGAs into an appliance, and the process of offloading certain tasks to the FPGA in proprietary applications that, to the research scientist, look like any other command-line analysis program.

With all this said, the fact remains that these use cases are all on the fringe of HPC.  They present a black-and-white decision to researchers; to benefit from FPGAs, scientists must completely buy into the applications, algorithms, and software stacks.  Seeing as how these FPGA HPC stacks are often closed-source and proprietary, the benefit of being able to see, modify, and innovate on open-source scientific code often outweighs the speedup benefits of the fast-but-rigid FPGA software ecosystem.

Where FPGAs will fit in HPC tomorrow

The way I see it, there are two things that must happen before FPGAs can become a viable general-purpose technology for accelerating HPC:
  1. Users must be able to integrate FPGA acceleration into their existing applications rather than replace their applications wholesale with proprietary FPGA analogues.
  2. It has to be as easy as f90 -fopenacc or nvcc to build an FPGA-accelerated application, and running the resulting accelerated binary has to be as easy as running an unaccelerated binary.
The first steps towards realizing this have already been made; both Xilinx and Intel/Altera now offer OpenCL runtime environments that allow scientific applications to offload computational kernels to the FPGA.  The Xilinx environment operates much like an OpenCL accelerator, where specific kernels are compiled for the FPGA and loaded as application-specific logic; the Altera environment installs a special OpenCL runtime environment on the FPGA.  However, there are a couple of challenges:
  • OpenCL tends to be very messy to code in compared to simpler APIs such as OpenACC, OpenMP, CUDA, or HIP.  As a result, not many HPC application developers are investing in OpenCL anymore.
  • Compiling an application for OpenCL on an FPGA still requires going through the entire Xilinx or Altera toolchain.  At present, this is not as simple as f90 -fopenacc or nvcc, and the process of compiling code that targets an FPGA can take orders of magnitude longer than it would for a CPU due to the NP-hard nature of placing and routing across all the programmable blocks.
  • The FPGA OpenCL stacks are not as polished and scientist-friendly right now; performance analysis and debugging generally still has to be done at the circuit level, which is untenable for domain scientists.
Fortunately, these issues are under very active development, and the story surrounding FPGAs for HPC application improves on a month by month basis.  We're still years from FPGAs becoming a viable option for accelerating scientific applications in a general sense, but when that day comes, I predict that programming in Verilog for FPGAs will seem as exotic as programming in assembly is for CPUs.

Rather, applications will likely rely on large collections of pre-compiled FPGA IP blocks (often called FPGA overlays) that map to common compute kernels.  It will then be the responsibility of compilers to identify places in the application source code where these logic blocks should be used to offload certain loops.  Since it's unlikely that a magic compiler will be able to identify these loops on their own, users will still have to rely on OpenMP, OpenACC, or some other API to provide hints at compile time.  Common high-level functions, such as those provided by LAPACK, will probably also be provided by FPGA vendors as pre-compiled overlays that are hand-tuned.

Concluding Thoughts

We're still years away from FPGAs being a viable option for mainstream HPC, and as such, I don't anticipate them as being the key technology that will underpin the world's first exascale systems.  Until the FPGA software ecosystem and toolchain mature to a point where domain scientists never have to look at a line of Verilog, FPGAs will remain an accelerator technology at the fringes of HPC.

However, there is definitely a path for FPGAs to become mainstream, and forward progress is being made.  Today's clunky OpenCL implementations are already being followed up by research into providing OpenMP-based FPGA acceleration, and proofs of concept demonstrating OpenACC-based FPGA acceleration have shown promising levels of performance portability.  On the hardware side, FPGAs are also approaching first-class citizenship with Intel planning to ship Xeons with integrated FPGAs in 2H2018 and OpenPOWER beginning to ship Xilinx FPGAs with OpenCAPI-based coherence links for POWER9.

The momentum is growing, and the growing urgency surrounding post-Moore computing technology is driving investments and demand from both public and private sectors.  FPGAs won't be the end-all solution that gets us to exascale, nor will it be the silver bullet that gets us beyond Moore's Law computing, but they will definitely play an increasingly important role in HPC over the next five to ten years.

If you've gotten this far and are interested in more information, I strongly encourage you to check out FPGAs for Supercomputing: The Why and How, presented by Hal Finkel, Kazutomo Yoshii, and Franck Cappello at ASCAC.  It provides more insight into the application motifs that FPGAs can accelerate, and a deeper architectural treatment of FPGAs as understood by real computer scientists.

** This is not really true.  Such a design would be limited by the number of physical pins coming out of the FPGA; in reality, output pins would have to be multiplexed, and additional logic to drive this multiplexing would take up FPGA real estate.  But you get the point.
SaveSave
SaveSaveSaveSave

Thursday, August 3, 2017

Understanding I/O on the mid-2017 iMac

My wife recently bought me a brand new mid-2017 iMac to replace my ailing, nine-year-old HP desktop.  Back when I got the HP, I was just starting to learn about how computers really worked and really didn't really understand much about how the CPU connected to all of the other ports that came off the motherboard--everything that sat between the SATA ports and the CPU itself was a no-man's land of mystery to me.

Between then and now though, I've somehow gone from being a poor graduate student doing molecular simulation to a supercomputer I/O architect.  Combined with the fact that my new iMac had a bunch of magical new ports that I didn't understand (USB-C ports that can tunnel PCIe, USB 3.1, and Thunderbolt??), I figure I'd sit down and see if I could actually figure out exactly how the I/O subsystem on this latest Kaby Lake iMac was wired up.

I'll start out by saying that the odds were in my favor--over the last decade, the I/O subsystem of modern computers has gotten a lot simpler as more of the critical components (like the memory controllers and PCIe controllers) have moved on-chip.  As CPUs become more tightly integrated, individual CPU cores, system memory, and PCIe peripherals can all talk to each other without having to cross a bunch of proprietary middlemen like in days past.  Having to understand how the front-side bus clock is related to the memory channel frequency all gets swept under the rug that is the on-chip network, and I/O (that is, moving data between system memory and stuff outside of the CPU) is a lot easier.

With all that said, let's cut to the chase.  Here's a block diagram showing exactly how my iMac is plumbed, complete with bridges to external interfaces (like PCIe, SATA, and so on) and the bandwidths connecting them all:




Aside from the AMD Radeon GPU, just about every I/O device and interface hangs off of the Platform Controller Hub (PCH) through a DMI 3.0 connection.  When I first saw this, I was a bit surprised by how little I understood; PCIe makes sense since that is the way almost all modern CPUs (and their memory) talk to the outside world, but I'd never given the PCH a second thought, and I didn't even know what DMI was.

As with any complex system though, the first step towards figuring out how it all works is to break it down into simpler components.  Here's what I figured out.

Understanding the PCH

In the HPC world, all of the performance-critical I/O devices (such as InfiniBand channel adapters, NICs, SSDs, and GPUs) are all directly attached to the PCIe controller on the CPU.  By comparison, the PCH is almost a non-entity in HPC nodes since all they do is provide low-level administration interfaces like a USB and VGA port for crash carts.  It had never occurred to me that desktops, which are usually optimized for universality over performance, would depend so heavily on the rinky-dink PCH.

Taking a closer look at the PCIe devices that talk to the Sunrise Point PCH:



we can see that the PCH chip provides PCIe devices that act as

  • a USB 3.0 controller
  • a SATA controller
  • a HECI controller (which acts as an SMBus controller)
  • a LPC controller (which acts as an ISA controller)
  • a PCI bridge (0000:00:1b) (to which the NVMe drive, not a real PCI device, is attached)
  • a PCIe bridge (0000:00:1c) that breaks out three PCIe root ports
Logically speaking, these PCIe devices are all directly attached to the same PCIe bus (domain #0000, bus #00; abbreviated 0000:00) as the CPU itself (that is, the host bridge device #00, or 0000:00:00).  However, we know that the PCH, by definition, is not integrated directly into the on-chip network of the CPU (that is, the ring that allows each core to maintain cache coherence with its neighbors).  So how can this be?  Shouldn't there be a bridge that connects the CPU's bus (0000:00) to a different bus on the PCH?

Clearly the answer is no, and this is a result of Intel's proprietary DMI interface which connects the CPU's on-chip network to the PCH in a way that is transparent to the operating system.  Exactly how DMI works is still opaque to me, but it acts like an invisible PCIe bridge that glues together physically separate PCIe buses into a single logical bus.  The major limitation to DMI as implemented on Kaby Lake is that it only has the bandwidth to support four lanes of PCIe Gen 3.

Given that DMI can only support the traffic of a 4x PCIe 3.0 device, there is an interesting corollary: the NVMe device, which attaches to the PCH via a 4x PCIe 3.0 link itself, can theoretically saturate the DMI link.  In such a case, all other I/O traffic (such as that coming from SATA-attached hard drive and the gigabit NIC) is either choked out by the NVMe device or competes with it for bandwidth.  In practice, very few NVMe devices can actually saturate a PCIe 3.0 4x link though, so unless you replace the iMac's NVMe device with an Optane SSD, this shouldn't be an issue.

Understanding Alpine Ridge

The other mystery component in the I/O subsystem is the Thunderbolt 3 controller (DSL6540), called Alpine Ridge.  These are curious devices that I still admittedly don't understand fully (they play no role in HPC) because, among other magical properties, they can tunnel PCIe to external devices.  For example, the Thunderbolt to Ethernet adapter widely available for MacBooks are actually fully fledged PCIe NICs, wrapped in a neat white plastic package, that tunnel PCIe signaling over a cable.  In addition, they can somehow deliver this PCIe signaling, DisplayPort, and USB 3.1 through a single self-configuring physical interface.

It turns out that being able to run multiple protocols over a single cable is a feature of the USB-C physical specification, which is a completely separate standard from USB 3.1.  However, the PCIe magic that happens inside Alpine Ridge is a result of an integrated PCIe switch which looks like this:



The Alpine Ridge PCIe switch connects up to the PCH with a single PCIe 3.0 4x and provides four downstream 4x ports for peripherals.  If you read the product literature for Alpine Ridge, it advertises two of these 4x ports for external connectivity; the remaining two 4x ports are internally wired up to two other controllers:

  • an Intel 15d4 USB 3.1 controller.  Since USB 3.1 runs at 10 Gbit/sec, this 15d4 USB controller  should support at least two USB 3.1 ports that can talk to the upstream PCH at full speed
  • an Thunderbolt NHI controller.  According to a developer document from Apple, NHI is the native host interface for Thunderbolt and is therefore the true heart of Alpine Ridge.
The presence of the NHI on the PCIe switch is itself kind of interesting; it's not a peripheral device so much as a bridge that allows non-PCIe peripherals to speak native Thunderbolt and still get to the CPU memory via PCIe.  For example, Alpine Ridge also has a DisplayPort interface, and it's likely that DisplayPort signals enter the PCIe subsystem through this NHI controller.

Although Alpine Ridge delivers some impressive I/O and connectivity options, it has some pretty critical architectural qualities that limit its overall performance in a desktop.  Notably,

  • Apple recently added support for external GPUs that connect to MacBooks through Thunderbolt 3.  While this sounds really awesome in the sense that you could turn a laptop into a gaming computer on demand, note that the best bandwidth you can get between an external GPU and the system memory is about 4 GB/sec, or the performance of a single PCIe 3.0 4x link.  This pales in comparison to the 16 GB/sec bandwidth available to the AMD Radeon which is directly attached to the CPU's PCIe controller in the iMac.
  • Except in the cases where Thunderbolt-attached peripherals are talking to each other via DMA, they appear to all compete with each other for access to the host memory through the single PCIe 4x upstream link.  4 GB/sec is a lot of bandwidth for most peripherals, but this does mean that an external GPU and a USB 3.1 external SSD or a 4K display will be degrading each others' performance.
In addition, Thunderbolt 3 advertises 40 Gbit/sec performance, but PCIe 3.0 4x only provides 32 Gbit/sec.  Thus, it doesn't look like you can actually get 40 Gbit/sec from Thunderbolt all the way to system memory under any conditions; the peak Thunderbolt performance is only available between Thunderbolt peripherals.

Overall Performance Implications

The way I/O in the iMac is connected definitely introduces a lot of performance bottlenecks that would make this a pretty scary building block for a supercomputer.  The fact that the Alpine Ridge's PCIe switch has a 4:1 taper to the PCH, and the PCH then further tapers all of its peripherals to a single 4x link to the CPU, introduces a lot of cases where performance of one component (for example, the NVMe SSD) can depend on what another device (for example, a USB 3.1 peripheral) is doing.  The only component which does not compromise on performance is the Radeon GPU, which has a direct connection to the CPU and its memory; this is how all I/O devices in typical HPC nodes are connected.

With all that being said, the iMac's I/O subsystem is a great design for its intended use.  It effectively trades peak I/O performance for extreme I/O flexibility; whereas a typical HPC node would ensure enough bandwidth to operate an InfiniBand adapter at full speed while simultaneously transferring data to a GPU, it wouldn't support plugging in a USB 3.1 hard drive or a 4K monitor.

Plugging USB 3 hard drives into an HPC node is surprisingly annoying.  I've had to do this for bioinformaticians, and it involves installing a discrete PCIe USB 3 controller alongside high-bandwidth network controllers.

Curiously, as I/O becomes an increasingly prominent bottleneck in HPC though, we are beginning to see very high-performance and exotic I/O devices entering the market.  For example, IBM's BlueLink  is able to carry a variety of protocols at extreme speeds directly into the CPU, and NVLink over BlueLink is a key technology enabling scaled-out GPU nodes in the OpenPOWER ecosystem.  Similarly, sophisticated PCIe switches are now proliferating to meet the extreme on-node bandwidth requirements of NVMe storage nodes.

Ultimately though, PCH and Thunderbolt aren't positioned well to become HPC technologies.  If nothing else, I hope this breakdown helps illustrate how performance, flexibility, and cost drive the system designs decisions that make desktops quite different from what you'd see in the datacenter.

Appendix: Deciphering the PCIe Topology

Figuring out everything I needed to write this up involved a little bit of anguish.  For the interested reader, here's exactly how I dissected my iMac to figure out how its I/O subsystem was plumbed.

Foremost, I had to boot my iMac into Linux to get access to dmidecode and lspci since I don't actually know how to get at all the detailed device information from macOS.  From this,

ubuntu@ubuntu:~$ lspci -t -v
-[0000:00]-+-00.0  Intel Corporation Device 591f
           +-01.0-[01]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480]
           |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Device aaf0
           +-14.0  Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller
           +-16.0  Intel Corporation Sunrise Point-H CSME HECI #1
           +-17.0  Intel Corporation Sunrise Point-H SATA controller [AHCI mode]
           +-1b.0-[02]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961
           +-1c.0-[03]----00.0  Broadcom Limited BCM43602 802.11ac Wireless LAN SoC
           +-1c.1-[04]--+-00.0  Broadcom Limited NetXtreme BCM57766 Gigabit Ethernet PCIe
           |            \-00.1  Broadcom Limited BCM57765/57785 SDXC/MMC Card Reader
...

we see a couple of notable things right away:

  • there's a single PCIe domain, numbered 0000
  • everything branches off of PCIe bus number 00
  • there are a bunch of PCIe bridges hanging off of bus 00 (which connect to bus number 0102, etc)
  • there are a bunch of PCIe devices hanging off both bus 00 and the other buses such as device 0000:00:14 (a USB 3.0 controller) and device 0000:01:00 (the AMD/ATI GPU)
  • at least one device (the GPU) has multiple PCIe functions (0000:01:00.0, a video output, and 0000:01:00.1 an HDMI audio output)

But lspci -t -v actually doesn't list everything that we know about.  For example, we know that there are bridges that connect bus 00 to the other buses, but we need to use lspci -Dv to actually see the information those bridges provides to the OS:

ubuntu@ubuntu:~$ lspci -vD
0000:00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
DeviceName: SATA
Subsystem: Apple Inc. Device 0180
        ...
0000:00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode])
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        ...
Kernel driver in use: pcieport
0000:00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31) (prog-if 30 [XHCI])
Subsystem: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller
        ...
Kernel driver in use: xhci_hcd
0000:01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev c0) (prog-if 00 [VGA controller])
Subsystem: Apple Inc. Ellesmere [Radeon RX 470/480]
        ...
Kernel driver in use: amdgpu
0000:01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aaf0
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device aaf0
        ...
Kernel driver in use: snd_hda_intel
This tells us more useful information:

  • Device 0000:00:00 is the PCIe host bridge--this is the endpoint that all PCIe devices use to talk to the CPU and, by extension, system memory (since the system memory controller lives on the same on-chip network that the PCIe controller and the CPU cores do)
  • The PCIe bridge connecting bus 00 and bus 01 (0000:00:01) is integrated into the PCIe controller on the CPU.  In addition, the PCI ID for this bridge is the same as the one used on Intel Skylake processors--not surprising, since Kaby Lake is an optimization (not re-architecture) of Skylake.
  • The two PCIe functions on the GPU--0000:01:00.0 and 0000:01:00.1--are indeed a video interface (as evidenced by the amdgpu driver) and an audio interface (snd_hda_intel driver).  Their bus id (01) also indicates that they are directly attached to the Kaby Lake processor's PCIe controller--and therefore enjoy the lowest latency and highest bandwidth available to system memory.
Finally, the Linux kernel's procfs interface provides a very straightforward view of every PCIe device's connectivity by presenting them as symlinks:

ubuntu@ubuntu:/sys/bus/pci/devices$ ls -l
... 0000:00:00.0 -> ../../../devices/pci0000:00/0000:00:00.0
... 0000:00:01.0 -> ../../../devices/pci0000:00/0000:00:01.0
... 0000:00:14.0 -> ../../../devices/pci0000:00/0000:00:14.0
... 0000:00:16.0 -> ../../../devices/pci0000:00/0000:00:16.0
... 0000:00:17.0 -> ../../../devices/pci0000:00/0000:00:17.0
... 0000:00:1b.0 -> ../../../devices/pci0000:00/0000:00:1b.0
... 0000:00:1c.0 -> ../../../devices/pci0000:00/0000:00:1c.0
... 0000:00:1c.1 -> ../../../devices/pci0000:00/0000:00:1c.1
... 0000:00:1c.4 -> ../../../devices/pci0000:00/0000:00:1c.4
... 0000:00:1f.0 -> ../../../devices/pci0000:00/0000:00:1f.0
... 0000:00:1f.2 -> ../../../devices/pci0000:00/0000:00:1f.2
... 0000:00:1f.3 -> ../../../devices/pci0000:00/0000:00:1f.3
... 0000:00:1f.4 -> ../../../devices/pci0000:00/0000:00:1f.4
... 0000:01:00.0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0
... 0000:01:00.1 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.1
... 0000:02:00.0 -> ../../../devices/pci0000:00/0000:00:1b.0/0000:02:00.0
... 0000:03:00.0 -> ../../../devices/pci0000:00/0000:00:1c.0/0000:03:00.0
... 0000:04:00.0 -> ../../../devices/pci0000:00/0000:00:1c.1/0000:04:00.0
... 0000:04:00.1 -> ../../../devices/pci0000:00/0000:00:1c.1/0000:04:00.1
... 0000:05:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0
... 0000:06:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:00.0
... 0000:06:01.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:01.0
... 0000:06:02.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:02.0
... 0000:06:04.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:04.0
... 0000:07:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:00.0/0000:07:00.0
... 0000:08:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:02.0/0000:08:00.0

This topology, combined with the lspci outputs above, reveals that most of the I/O peripherals are either directly provided by or hang off of the Sunrise Point chip.  There is another fan-out of PCIe ports hanging off of the Alpine Ridge chip (0000:00:1b.0 and 0000:00:1c.{0,1,4}), and what's not shown are the Native Thunderbolt (NHI) connections, such as DisplayPort, on the other side of the Alpine Ridge.  Although I haven't looked very hard, I did not find a way to enumerate these Thunderbolt NHI devices.

There remain a few other open mysteries to me as well; for example, lspci -vv reveals the PCIe lane width of most PCIe-attached devices, but it does not obviously display the maximum lane width for each connection.  Furthermore, the USB, HECI, SATA, and LPC bridges hanging off the Sunrise Point do not list a lane width at all, so I still don't know exactly what level of bandwidth is available to these bridges.

If anyone knows more about how to peel back the onion on some of these bridges, or if I'm missing any important I/O connections between the CPU, PCH, or Alpine Ridge that are not enumerated via PCIe, please do let me know!  I'd love to share the knowledge and make this more accurate if possible.

Saturday, May 27, 2017

A less-biased look at tape versus disks

Executive Summary

Tape isn't dead despite what object store vendors may tell you, and it still plays an important role in both small- and large-scale storage environments.  Disk-based object stores certainly have eroded some of the areas where tape has historically been the obvious choice, but in the many circumstances where low latency is not required and high cost cannot be tolerated, tape remains a great option.

This post is a technical breakdown of some of the misconceptions surrounding the future of tape in the era of disk-based object stores as expressed in a recent blog post from an object store vendor's chief marketing officer.  Please note that the opinions stated below are mine alone and not a reflection of my employer or the organizations and companies mentioned.  I also have no direct financial interests in any tape, disk, or object store vendors or technologies.

Introduction

IBM 701 tape drive--what many people picture when they hear about tape-based storage.  It's really not still like this, I promise.
Scality, and object store software vendor whose product relies on hard disk-based (HDD-based) storage, recently posted a marketing blog post claiming that tape is finally going to die and disk is the way of the future.  While I don't often rise to the bait of marketing material, tape takes a lot more flack than it deserves because of how old and of a technology it is.  There is no denying that tape is old--it actually precedes the first computers by decades, and digital tape recording goes back to the early 1950s.  Like it or not though, tape technology is about as up-to-date as HDD technology (more on this later), and you're likely still using tape on a regular basis whether you like it or not.  For example, Google relies on tape to archive your everyday including Gmail because, in terms of cost per bit and power consumption, tape will continue to beat disk for years to come.  So in the interests of sticking up for tape for both its good and bad, let's walk through Scality's blog post, authored by their chief of marketing Paul Turner, and tell the other side of the story.

1. Declining Tape Revenues

Mr. Turner starts by pointing out that "As far back as 2010, The Register reported a 25% decline in tape drive and media sales."  This decrease is undeniably true:

Market trends for LTO tape, 2008-2015.  Data from the Santa Clara Consulting Group, presented at MSST 2016 by Bob Fontana (IBM)

Although tape revenue has been decreasing, an increasing amount of data is landing on tape.  How can these seemingly contradictory trends be reconciled?

The reality is that the tape industry at large is not technologically limited like CPU processors, flash storage, or even spinning disk.  Rather, the technology that underlies both the magnetic tape media and the drive heads that read and write this media are actually lifted over from the HDD industry.  That is, the hottest tape drives on the market today are using technology that the HDD industry figured out years ago.  As such, even if HDD innovation completely halted overnight, the tape industry would still be able to release new products for at least one or two more technology generations.

This is all to say that the rate at which new tape technologies reach market are not limited by the rate of innovation in the underlying storage technology.  Tape vendors simply lift over HDD innovations into new tape products when it becomes optimally profitable to do so, so declining tape revenues simply means that the cadence of the tape technology refresh will stretch out.  While this certainly widens the gap between HDD and tape and suggests a slow down-ramping of tape as a storage media, you cannot simply extrapolate these market trends in tape down to zero.  The tape industry simply doesn't work like that.

2. The Shrinking Tape Vendor Ecosystem

Mr. Turner goes on to cite an article published in The Register about Oracle's EOL of the StorageTek line of enterprise tape:
"While this falls short of a definitive end-of-life statement, it certainly casts serious doubt on the product’s future. In fairness, we’ll note that StreamLine is a legacy product family originally designed and built for mainframes. Oracle continues to promote the open LTO tape format, which is supported by products from IBM, HPE, Quantum, and SpectraLogic."
To be fair, Mr. Turner deserves credit for pointing out that StorageTek (which was being EOL'ed) and LTO are different tape technologies, and Oracle continues to support LTO.  But let's be clear here--the enterprise (aka mainframe) tape market has been roughly only 10% of the global tape market by exabytes shipped, and even then, IBM and Oracle have been the only vendors in this space.  Oracle's exit from the enterprise tape market is roughly analogous to Intel recently EOL'ing Itanium with the 9700-series Kittson chips in that a boutique product is being phased out in favor of a product that hits a much wider market.

3. The Decreasing Cost of Disk

Mr. Turner goes on to cite a Network Computing article:
"In its own evaluation of storage trends, including the increasing prevalence of cloud backup and archiving, Network Computing concludes that “…tape finally appears on the way to extinction.” As evidence, they cite the declining price of hard disks,"
Hard disk prices decrease on a cost per bit basis, but there are a few facts that temper the impact of this trend:

Point #1: HDDs include both the media and the drive that reads the media.  This makes the performance of HDDs scale a lot more quickly than tape, but it also means HDDs have a price floor of around $40 per device.  The cost of the read heads, voice coil, and drive controller are not decreasing.  When compared to the tape cartridges of today (whose cost floor is limited by the magnetic tape media itself) or the archival-quality flash of tomorrow (think of how cheaply thumb drives can be manufactured), HDD costs don't scale very well.  And while one can envision shipping magnetic disk platters that rely on external drives to drive the cost per bit down, such a solution would look an awful lot like a tape archive.

Point #2: The technology that underpins the bit density of hard drives has been rapidly decelerating.  The ultra high-density HDDs of today seem to have maxed out at around 1 terabit per square inch using parallel magnetic recording (PMR) technology, so HDD vendors are just cramming more and more platters into individual drives.  As an example, Seagate's recently unveiled 12 TB PMR drives contain an astounding eight platters and sixteen drive heads; their previous 10 TB PMR drives contained seven platters, and their 6 TB PMR drives contained five platters.  Notice a trend?

There are truly new technologies that radically change the cost-per-bit trajectory for hard drives which include shingled magnetic recording (SMR), heat-assisted magnetic recording (HAMR), and bit-patterned media (BPM).  However, SMR's severe performance limitations for non-sequential writes make them a harder sell as a wholesale replacement for tape.  HAMR and BPM hold much more universal promise, but they simply don't exist as products yet and therefore simply don't compete with tape.  Furthermore, considering our previous discussion of how tape technology evolves, the tape industry has the option to adopt these very same technologies to drive down the cost-per-bit of tape by a commensurate amount.

4. The Decreasing Cost of Cloud

Mr. Turner continues citing the Network Computing article, making the bold claim that two other signs of the end of tape are
"...the ever-greater affordability of cloud storage,"
This is deceptive.  The cloud is not a charitable organization; their decreasing costs are a direct reflection of the decreasing cost per bit of media, which are savings that are realized irrespective of whether the media is hosted by a cloud provider or on-premise.  To be clear, the big cloud providers are definitely also reducing their costs by improving their efficiencies at scale; however, these savings are transferred to their customers only to the extent that they can be price competitive with each other.  My guess, which is admittedly uneducated, is that most of these cost savings are going to shareholders, not customers.
"and the fact that cloud is labor-free."
Let's be real here--labor is never "free" in the context of data management.  It is true that you don't need to pay technicians to swap disks in your datacenter if you have no tape (or no datacenter).  However, it's a bit insulting to presume that the only labor done by storage engineers is replacing disks.  Storage requires babysitting regardless of if it lives in the cloud or on-premise, and regardless of if it is backed by tape or disk.  It needs to be integrated with the rest of a company's infrastructure and operations, and this is where the principal opex of storage should be spent.  Any company that is actually having to scale personnel linearly with storage is doing something terribly wrong, and making the choice to migrate to the cloud to save opex is likely putting a band-aid over a much bigger wound.

Finally, this cloud-tape argument conflates disk as a technology and cloud as a business model.  There's nothing preventing tape from existing in the cloud; in fact, the Oracle Cloud does exactly this and hosts archival data in StorageTek archives at absolute rock-bottom prices--$0.001/GB, which shakes out to $1,000 per month to host a petabyte of archive.  Amazon Glacier also offers a tape-like performance and cost balance relative to its disk-based offerings.  The fact that you don't have to see the tapes in the cloud doesn't mean they don't exist and aren't providing you value.

5. The Performance of Archival Disk over Tape

The next argument posed by Mr. Turner is the same one that people have been using to beat up on tape for decades:
"...spotlighting a tape deficit that’s even more critical than price: namely, serial--and glacially slow--access to data."
This was a convincing argument back in the 1980s, but to be frank, it's really tired at this point.  If you are buying tape for low latency, you are doing something wrong.

As I discussed above, tape's benefits lie in its
  1. rock-bottom cost per bit, achievable because it uses older magnetic recording technology and does not package the drive machinery with the media like disk does, and
  2. total cost of ownership, which is due in large part to the fact that it does not draw power when data is at rest.
I would argue that if 
  1. you don't care about buying the cheapest bits possible (for example, if the cost of learning how to manage tape outweighs the cost benefits of tape at your scale), or
  2. you don't care about keeping power bills low (for example, if your university foots the power bill)
there are definitely better options for mass storage than tape.  Furthermore, if you need to access any bit of your data at nearline speeds, you should definitely be buying nearline storage media.  Tape is absolutely not nearline, and it would just be the wrong tool for the job.

However, tape remains the obvious choice in cases where data needs to be archived or a second copy has to be retained.  Consider the following anecdotes:
In both cases--offline second copy and offline archive--storing data in nearline storage often just doesn't make economic sense since the data is not being frequently accessed.

However, it is critical to point out that there are scales at which using tape does not make great sense. Let's break these scales out and look at each:

At small scales where the number of cartridges in on the same order as the number of drives (e.g., a single drive with a handful of cartridges), tape is not too difficult to manage.  At these scales, such as those which might be found in a small business' IT department, performing offline backups of financials to tape is a lot less expensive than continually buying external USB drives and juggling them.

At large scales where the number of cartridges is far larger than the number of drives (e.g., in a data-driven enterprise or large-scale scientific computing complex), tape is also not too difficult to manage.  The up-front cost of tape library infrastructure and robotics is amortized by the annual cost of media, and sophisticated data management software (more on this below!) prevents humans from having to juggle tapes manually.

At medium scales, tape can be painful.  If the cost of libraries and robotics is difficult to justify when compared to the cost of the media (and therefore has a significant impact on the net $/GB of tape), you wind up having to pay people to do the job of robots in managing tapes.  This is a dangerous way to operate, as you are tickling the upper limits of how far you can scale people and you have to carefully consider how much runway you've got before you are better off buying robotics, disks, or cloud-based resources.

6. The Usability of Archival Disk over Tape

The Scality post then begins to paint with broad strokes:
"To access data from a disk-based archive, you simply search the index, click on the object or file you want, and presto, it’s yours.  By contrast, pulling a specific file from tape is akin to pulling teeth. First, you physically comb through a pile of cartridges, either at a remote site or by having them trucked to you."
The mistake that Mr. Turner makes here is conflating disk media with archival software.  Tape archives come with archival software just like disk archives do.  For example, HPSS indexes metadata from objects stored on tape in a DB2 database.  There's no "pulling teeth" to "identify a cartridge that seems to contain what you're looking for" and no "manually scroll[ing] through to pinpoint and retrieve the data."

Data management software systems including HPSSIBM's Spectrum ProtectCray's TAS, and SGI's DMF all provide features that can make your tape archive look an awful lot like an object store if you want them.  The logical semantics of storing data on disks versus tape are identical--you put some objects into an archive, and you get some objects out later.  The only difference is the latency of retrieving data on a tape.

That said, these archival software solutions also allow you to use both tape and disk together to ameliorate the latency hit of retrieving warmer data from the archive based on heuristics, data management policies, or manual intervention.  In fact, they provide S3 interfaces too, so you can make your tapes and disk-based object stores all look like one archive--imagine that!

What this all boils down to is that the perceived usability of tape is a function of the software on top of it, not the fact that it's tape and not disk.

7. Disks Enable Magical Business Intelligence

The Scality post tries to drive the last nail in the coffin of tape by conjuring up tales of great insight enabled by disk:
"...mountains of historical data are a treasure trove of hidden gems—patterns and trends of purchasing activity, customer preferences, and user behavior that marketing, sales, and product development can use to create smarter strategies and forecasts."
and
"Using disk-based storage, you can retrieve haystacks of core data on-demand, load it into analytic engines, and emerge with proverbial “needles” of undiscovered business insight."
which is to imply that tape is keeping your company stupid, and migrating to disk will propel you into a world of deep new insights:

Those of us doing statistical analysis on a daily basis keep this xkcd comic taped to our doors and pinned to our cubes.  We hear it all the time.

This is not to say that the technological sentiment expressed by Mr. Turner is wrong; if you have specific analyses you would like to perform over massive quantities of data on a regular basis, hosting that data in offline tape is a poor idea.  But if you plan on storing your large archive on disk because you might want to jump on the machine learning bandwagon someday, realize that you may be trading significant, guaranteed savings on media for a very poorly defined opportunity cost.  This tradeoff may be worth the risk in some early-stage, fast-moving startups, but it is unappetizing in more conservative organizations.

I also have to point out that "[g]one are the days when data was retained only for compliance and auditing" is being quite dramatic and disconnected from the realities of data and lifecycle management.  A few anecdotes:

  • Compliance: The United States Department of Energy and the National Science Foundation both have very specific guidance regarding the retention and management of data generated during federally funded research.  At the same time, extra funding is generally not provided to help support this data management, so eating the total cost of ownership of storing such data on disk over tape can be very difficult to justify when there is no funding to maintain compliance, let alone perform open-ended analytics on such data.
  • Auditing: Keeping second copies of data critical to business continuity is often a basic requirement in demonstrating due diligence.  In data-driven companies and enterprises, it can be difficult to rationalize keeping the second archival copy of such data nearline.  Again, it comes down to figuring out the total cost of ownership.
That said, the sentiment expressed by Mr. Turner is not wrong, and there are a variety of cases where keeping archival data nearline has clear benefits:
  • Cloud providers host user data on disk because they cannot predict when a user may want to look at an e-mail they received in 2008.  While it may cost more in media, power, and cooling to keep all users' e-mails nearline, being able to deliver desktop-like latency to users in a scalable way can drive significantly larger returns.  The technological details driving this use case have been documented in a fantastic whitepaper from Google.
  • Applying realtime analytics to e-commerce is a massive industry that is only enabled by keeping customer data nearline.  Cutting through the buzz and marketing floating surrounding this space, it's pretty darned neat that companies like Amazon, Netflix, and Pandora can actually suggest things to me that I might actually want to buy or consume.  These sorts of analytics could not happen if my purchase history was archived to tape.

Tape's like New Jersey - Not Really That Bad

Mr. Turner turns out to be the Chief Marketing Officer of Scality, a company that relies on disk to sell its product.  The greatest amount of irony, though, comes from the following statement of his:
"...Iron Mountain opines that tape is best. This is hardly a surprising conclusion from a provider of offsite tape archive services. It just happens to be incorrect."
Takeoff from Newark Liberty International Airport--what most people think of New Jersey.  It's really not all like this, I promise.
I suppose I shouldn't have been surprised that a provider of disk-dependent archival storage should conclude that tape is dead and disks are the future, and I shouldn't have risen to the bait.  But, like my home state of New Jersey, tape is a great punching bag for people with a cursory knowledge of it.  Just like how Newark Airport is what shapes most people's opinions of New Jersey, old images of reel-to-reel PDP-11s and audio cassettes make it easy to trash tape as a digital storage medium.  And I just as I will always feel unduly compelled to stick up for my home state, I can't help but fact-check people who want to beat up tape.

The reality is that tape really isn't that terrible, and there are plenty of aspects to it that make it a great storage technology.  Like everything in computing, understanding its strengths (its really low total cost) and weaknesses (its high access latency) is the best way to figure out if the costs of deploying or maintaining a tape-based archive make it a better alternative to disk-based archives.  For very small-scale or large-scale offline data archive, tape can be very cost effective.  As the Scality blog points out though, if you're somewhere in between, or if you need low-latency access to all of your data for analytics or serving user data, disk-based object storage may be a better value overall.

Many of Mr. Turner's points, if boiled down to their objective kernels, are not wrong.  Tape is on a slow decline in terms of revenue, and this may stretch out the cadence of new tape technologies hitting the market.  However there will always be a demand for high-endurance, low-cost, offline archive despite however good object stores become, and I have a difficult time envisioning a way in which tape completely implodes in the next ten years.  It may be the case that, just like how spinning disk is rapidly disappearing from home PCs, tape may become even more of a boutique technology that primarily exists as the invisible backing store for a cloud-based archival solution.  I just don't buy into the doom and gloom, and I'll bet blog posts heralding the doom of tape will keep coming for years to come.