White Papers

Hardware Platform Layer
Choosing the right hardware technology to support a given machine learning application is another challenge for
platform design. While CPUs can be used for deep learning, they are scalar multiplication engines by nature, and
poorly suited to the higher-order tensor operations common to deep learning (vectors, matrices, and beyond). So
machine learning platforms typically incorporate some form of accelerator technology - GPU, FPGA, or ASIC. But even
at that level there are trade-offs to consider particularly concerning the distribution of operations across multiple
accelerators and how that impacts scaling. These considerations are described below.
GPUs
GPUs have been the cornerstone of the deep learning growth in recent years because of their powerful parallel
compute capabilities derived from their relatively large number of independent logic cores. The different models for
how data is exchanged between GPUs is a differentiating feature when considering platform design.
FPGAs & ASICs
Though GPUs currently occupy a fortress on the deep learning market today, technology vendors from across the
globe are lining up to take aim at specific soft spots in the GPU’s dominance. The latest FPGA and ASIC technology
delivers new levels of component-level performance-per-dollar, performance-per-watt and small-batch efficiency that
will result in competitive offerings in 2018 and beyond that will provide alternatives to current deep learning hardware.
PCIe-based Accelerators
Using PCI-Express accelerators for machine learning has become popular for a number of previously discussed
reasons, however, one primary benefit is the ability to ‘scale-up’ to use multiple accelerators in the same server.
The challenge for effectively using more than one accelerator is data exchange between the cards. The latency
and bandwidth limitations for data going back through the host CPU’s PCIE root complex, for example, can be a
large performance penalty that negates the multi-accelerator benefit, as shown in Figure 4 below.
Modern non-blocking PCIE switches, as in Figure 5 below, can be a great solution to this challenge by allowing the
PCIE accelerators to exchange data directly without passing through the host root complex, if the framework
comprehends this type of communication path.
Figure 4: Acceleration using multiple GPUs tied to a single
CPU can result in performance bottlenecks.
Figure 5: Acceleration using non-blocking PCIe switch
Again, here, balance is the key. As you add accelerators to the switch, eventually the host bandwidth between the
switch and the (single host) CPU becomes the new bottleneck. Unfortunately, due to the variations in neural networks,
data sets, and frameworks, this point is a moving target, and very difficult to predict.
© 2018 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries