Administrator Guide
16 Deep Learning Performance Scale-Out
The above benchmarks shown in Figure 14 are done on 2 servers C4140 x4 V100 GPUs, each
connected by Mellanox ConnectX-5 network adapter with 100Gbit/s over IPoIB. The Dell EMC
distributed mode with Horovod achieved 85% of scaling efficiency for ResNet-50 batch size 256
compared with the ideal performance; on the other hand, it achieved 95% of scaling efficiency
versus a test run by TF team on 2018 with a VM (virtual machine) instance on GCP(Google cloud)
with 8x V100 GPUs and batch size=364 [5].
Performance Results - Long Tests Accuracy Convergence
Our final tests were to determine the total training time for accuracy convergence with the latest
tensorflow version.
In this section we decided to include all the batch sizes we tested in our previous paper and
compared it with ResNet-50 using batch size 256.
Figure 15 shows the total training time achieved when running ResNet-50 with different batch
sizes and both versions of tensorflow (TF 1.10 vs TF 1.14 with XLA enabled).
On average using TF 1.14 +XLA was ~1.3X faster than our previous tests.
Figure 15: Multi Node PowerEdge C4140-M - ResNet-50’s Long Training for Accuracy Conv.