White Papers

Ready Specs

Figure 2. Scaling on Zenith with Gold 6148 processors using /dev/shm as the storage

Figure 2 shows the results of our scalability tests on Skylake. When scaling from 1 node to 128 nodes, speedup is within 90% of perfect

scaling. Above that scaling starts to drop off more rapidly, falling to 83% and 76% of perfect for 256 and 314 nodes respectively. This is

mostly likely due to a combination of factors the first being decreasing node batch size. Individual nodes tend to offer the best

performance with larger batch sizes, but to keep the overall batch below 8k, the node batch size is decreased. Each node is then

running a suboptimal batch size. The second is communication overhead; the Intel Caffe default for multi-node weight updates utilizes

MPI collectives at the end of each batch to distribute the model weight data to all nodes. This allows each node to ‘see’ the data from all

other nodes without having to process all of the other images in the batch. It is why you get a training time improvement when using

multiple nodes instead of just training hundreds of individual models. Communication patterns and overhead is an area we plan to

investigate in the future.