White Papers

Ready Specs
4
Figure 2. Scaling on Zenith with Gold 6148 processors using /dev/shm as the storage
Figure 2 shows the results of our scalability tests on Skylake. When scaling from 1 node to 128 nodes, speedup is within 90% of perfect
scaling. Above that scaling starts to drop off more rapidly, falling to 83% and 76% of perfect for 256 and 314 nodes respectively. This is
mostly likely due to a combination of factors the first being decreasing node batch size. Individual nodes tend to offer the best
performance with larger batch sizes, but to keep the overall batch below 8k, the node batch size is decreased. Each node is then
running a suboptimal batch size. The second is communication overhead; the Intel Caffe default for multi-node weight updates utilizes
MPI collectives at the end of each batch to distribute the model weight data to all nodes. This allows each node to ‘see’ the data from all
other nodes without having to process all of the other images in the batch. It is why you get a training time improvement when using
multiple nodes instead of just training hundreds of individual models. Communication patterns and overhead is an area we plan to
investigate in the future.