White Papers

Ready Specs
3
Figure 1. Processor model performance comparison relative to Broadwell
The difference in performance between the Gold 6148 and Platinum 8168 SKUs is around 5%. These results show that for this
workload and version of Intel Caffe the higher end Platinum SKUs do not offer much in the way of additional performance over the Gold
CPUs. The KNL processor model tested provides very similar results to the Platinum models.
Multi-node Performance and Scaling
The multi-node runs were conducted on the HPC Innovation Lab’s Zenith cluster, which is a Top500 ranked cluster (#292 on the Nov
2017 list). Zenith contains over 324 Skylake nodes and 160 KNL nodes configured as listed in Table 1. The system uses Intel’s Omni-
Path Architecture for its high speed interconnect. The Omni-Path network consists of a single 768 port director switch, with all nodes
directly connected, providing a fully non-blocking fabric.
Scaling Caffe beyond a single node requires additional software, we used Intel’s Machine Learning Scalability Library (MLSL). MLSL
provides an interface for common deep learning communication patterns built on top of Intel MPI. It supports various high speed
interconnects and the API can be used by multiple frameworks.
The performance numbers on Zenith were obtained using /dev/shm, the same as we did for the single node tests. KNL multi-node tests
used a Dell EMC NFS Storage Solution (NSS), an optimized NFS solution. Batch sizes were constrained as node count increased to
keep the total batch size less than or equal to 8k, to keep it within the bounds of this particular model. As node count increases, the
total batch size across all the nodes in the test increases as well (assuming you keep the batch size per node constant). Very large
batch sizes complicate the gradient descent algorithm used to optimize the model, causing accuracy to suffer. Facebook has done work
getting distributed training methods to scale to 8k batch sizes.