White Papers

Ready Specs

Figure 1. Processor model performance comparison relative to Broadwell

The difference in performance between the Gold 6148 and Platinum 8168 SKUs is around 5%. These results show that for this

workload and version of Intel Caffe the higher end Platinum SKUs do not offer much in the way of additional performance over the Gold

CPUs. The KNL processor model tested provides very similar results to the Platinum models.

Multi-node Performance and Scaling

The multi-node runs were conducted on the HPC Innovation Lab’s Zenith cluster, which is a Top500 ranked cluster (#292 on the Nov

2017 list). Zenith contains over 324 Skylake nodes and 160 KNL nodes configured as listed in Table 1. The system uses Intel’s Omni-

Path Architecture for its high speed interconnect. The Omni-Path network consists of a single 768 port director switch, with all nodes

directly connected, providing a fully non-blocking fabric.

Scaling Caffe beyond a single node requires additional software, we used Intel’s Machine Learning Scalability Library (MLSL). MLSL

provides an interface for common deep learning communication patterns built on top of Intel MPI. It supports various high speed

interconnects and the API can be used by multiple frameworks.

The performance numbers on Zenith were obtained using /dev/shm, the same as we did for the single node tests. KNL multi-node tests

used a Dell EMC NFS Storage Solution (NSS), an optimized NFS solution. Batch sizes were constrained as node count increased to

keep the total batch size less than or equal to 8k, to keep it within the bounds of this particular model. As node count increases, the

total batch size across all the nodes in the test increases as well (assuming you keep the batch size per node constant). Very large

batch sizes complicate the gradient descent algorithm used to optimize the model, causing accuracy to suffer. Facebook has done work

getting distributed training methods to scale to 8k batch sizes.