Administrator Guide

3
Copyright © 2019 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries
Copyright © 2019 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries
The ResNet-50 model has 150,528 input neurons; 1,000 output neurons and 50 layers, totaling
3.8 billion operations. With recent improvements in the OpenVINO SDK, the Intel PAC with Arria
10 FPGA can comfortably run this ResNet-50 model at increased performance compared to
previously published performance numbers.
Results
We setup the Intel PAC on a Dell EMC R740 PowerEdge server with 2 Intel Xeon Gold 6130 CPU
@ 2.10 GHz running CentOS Linux (release 7.6). We present the FPGA performance in
comparison with the CPUs, noting the throughput, latency, and energy efficiency achieved across
both devices.
Throughput
Throughput is a measure of how fast sets of images are processed every second. Here, we
denote this measure as frames per second (FPS). Images can be processed in sets of 1 image
or more, also referred to as batch size.
Fig 4 shows the throughput of the FPGA and CPU for batch sizes 1, 16, and 64 respectively,
across different CPU configurations, i.e., number of threads. As indicated, the FPGA performance
is consistent for a given batch size regardless of the thread count, indicating the FPGA provides
deterministic performance. In practice, only 1 CPU thread (off-load) is necessary to achieve
maximum throughput with the FPGA; the rest of the CPU threads are freed to perform other tasks.
Conversely, as indicated by the dotted horizonal lines, only when the batch size is as large as 64
and the thread count is as much as 64 does the CPU surpass the FPGA in FPS. In short, doubling
the CPU threads (from 32 to 64) drives throughput by a mere 11% increase. For brevity, we have
considered thread counts in increasing power of 2. Curious readers are encouraged to try out
finer thread counts to reach equivalency in CPU vs FPGA performance.
Figure 4: Throughput. In dynamic systems, where the number of available cores may be
unknown, the FPGA will provide deterministic performance since only 1 thread is necessary.