Administrator Guide

16 Deep Learning Performance Scale-Out

The above benchmarks shown in Figure 14 are done on 2 servers C4140 x4 V100 GPUs, each

connected by Mellanox ConnectX-5 network adapter with 100Gbit/s over IPoIB. The Dell EMC

distributed mode with Horovod achieved 85% of scaling efficiency for ResNet-50 batch size 256

compared with the ideal performance; on the other hand, it achieved 95% of scaling efficiency

versus a test run by TF team on 2018 with a VM (virtual machine) instance on GCP(Google cloud)

with 8x V100 GPUs and batch size=364 [5].

Performance Results - Long Tests Accuracy Convergence

Our final tests were to determine the total training time for accuracy convergence with the latest

tensorflow version.

In this section we decided to include all the batch sizes we tested in our previous paper and

compared it with ResNet-50 using batch size 256.

Figure 15 shows the total training time achieved when running ResNet-50 with different batch

sizes and both versions of tensorflow (TF 1.10 vs TF 1.14 with XLA enabled).

On average using TF 1.14 +XLA was ~1.3X faster than our previous tests.

Figure 15: Multi Node PowerEdge C4140-M - ResNet-50’s Long Training for Accuracy Conv.