White Papers

Hardware Platform Layer

Choosing the right hardware technology to support a given machine learning application is another challenge for

platform design. While CPUs can be used for deep learning, they are scalar multiplication engines by nature, and

poorly suited to the higher-order tensor operations common to deep learning (vectors, matrices, and beyond). So

machine learning platforms typically incorporate some form of accelerator technology - GPU, FPGA, or ASIC. But even

at that level there are trade-offs to consider – particularly concerning the distribution of operations across multiple

accelerators and how that impacts scaling. These considerations are described below.

GPUs

GPUs have been the cornerstone of the deep learning growth in recent years because of their powerful parallel

compute capabilities derived from their relatively large number of independent logic cores. The different models for

how data is exchanged between GPUs is a differentiating feature when considering platform design.

FPGAs & ASICs

Though GPUs currently occupy a fortress on the deep learning market today, technology vendors from across the

globe are lining up to take aim at specific soft spots in the GPU’s dominance. The latest FPGA and ASIC technology

delivers new levels of component-level performance-per-dollar, performance-per-watt and small-batch efficiency that

will result in competitive offerings in 2018 and beyond that will provide alternatives to current deep learning hardware.

PCIe-based Accelerators

Using PCI-Express accelerators for machine learning has become popular for a number of previously discussed

reasons, however, one primary benefit is the ability to ‘scale-up’ to use multiple accelerators in the same server.

The challenge for effectively using more than one accelerator is data exchange between the cards. The latency

and bandwidth limitations for data going back through the host CPU’s PCIE root complex, for example, can be a

large performance penalty that negates the multi-accelerator benefit, as shown in Figure 4 below.

Modern non-blocking PCIE switches, as in Figure 5 below, can be a great solution to this challenge by allowing the

PCIE accelerators to exchange data directly without passing through the host root complex, if the framework

comprehends this type of communication path.

Figure 4: Acceleration using multiple GPUs tied to a single

CPU can result in performance bottlenecks.

Figure 5: Acceleration using non-blocking PCIe switch

Again, here, balance is the key. As you add accelerators to the switch, eventually the host bandwidth between the

switch and the (single host) CPU becomes the new bottleneck. Unfortunately, due to the variations in neural networks,

data sets, and frameworks, this point is a moving target, and very difficult to predict.