Administrator Guide

17 Deep Learning Performance Scale-Out
Conclusion
The performance with TF 1.14 among the models was just slightly superior (~1%-8%)
versus TF 1.10. On the other hand, TF 1.14 with XLA boosted the performance up to
~46% among the models ResNet-50, Inception-v3 and Inception-v4.
In the case of ResNet-50 model, its performance improved up to ~ 3% with TF 1.14, and
up to ~46% with TF 1.14 and XLA enabled. ResNet-50 batch size 256 scaled better
(1.46X) versus ResNet-50 BS 128 (1.35X).
The configuration with the highest throughput (img/sec) was ResNet-50 batch size 256
trained with distributed Horovod + TF 1.14 + XLA enabled.
Dell EMC PowerEdge C4140 using Nvidia 4x NVLink architecture scales relatively well
when using Uber Horovod distributed training library and Mellanox InfiniBand as the high-
speed link between nodes. It scale-out ~3.9X times within the node and ~6.9X across
nodes for ResNet-50 BS 256.
On average, the training time for the long tests to reach accuracy convergence were ~ 1.3
X faster using distributed Horovod + TF 1.14 + XLA enabled.
There is a lot of performance improvement being added continuously either at the GPU
level, library level or framework level. We are continuously looking at how we can improve
our performance results by experimenting with different hyper parameters.
TensorFlow in multi-GPU/multi-node with Horovod Distributed and XLA support improve
model performance and reduce the training time, allowing customers to do more with no
additional hardware investment.