Administrator Guide

17 Deep Learning Performance Scale-Out

Conclusion

• The performance with TF 1.14 among the models was just slightly superior (~1%-8%)

versus TF 1.10. On the other hand, TF 1.14 with XLA boosted the performance up to

~46% among the models ResNet-50, Inception-v3 and Inception-v4.

• In the case of ResNet-50 model, its performance improved up to ~ 3% with TF 1.14, and

up to ~46% with TF 1.14 and XLA enabled. ResNet-50 batch size 256 scaled better

(1.46X) versus ResNet-50 BS 128 (1.35X).

• The configuration with the highest throughput (img/sec) was ResNet-50 batch size 256

trained with distributed Horovod + TF 1.14 + XLA enabled.

• Dell EMC PowerEdge C4140 using Nvidia 4x NVLink architecture scales relatively well

when using Uber Horovod distributed training library and Mellanox InfiniBand as the high-

speed link between nodes. It scale-out ~3.9X times within the node and ~6.9X across

nodes for ResNet-50 BS 256.

• On average, the training time for the long tests to reach accuracy convergence were ~ 1.3

X faster using distributed Horovod + TF 1.14 + XLA enabled.

• There is a lot of performance improvement being added continuously either at the GPU

level, library level or framework level. We are continuously looking at how we can improve

our performance results by experimenting with different hyper parameters.

• TensorFlow in multi-GPU/multi-node with Horovod Distributed and XLA support improve

model performance and reduce the training time, allowing customers to do more with no

additional hardware investment.