White Papers

ManualsBrandsDell ManualsConverged InfrastructureServers Solution Resources

Deep Learning Performance: Scale-up vs Scale-out

Architectures & Technologies Dell EMC | Infrastructure Solutions Group

In Figure 14 we see how the GPU memory is accessed directly instead of copying the data n times

across the system components with the use of GPUDirect RDMA, this feature is reflected directly

in the throughput performance of the server.

Figure 14: Nvidia GPU Direct RDMA Connection. Source: https://www.sc-asia.org

6.2 Evaluation Platform Setup

Table 4 shows the software stack configuration used to build the environment to run the tests.

Software Stack

PowerEdge Servers

Non-Dell EMC Servers

Ubuntu 16.04.4 LTS

Ubuntu 16.04.3 LTS

Kernel

GNU/Linux 4.4.0-128-generic x86_64

GNU/Linux 4.4.0-130-generic x86_64

nvidia driver

396.26 for all servers

390.46 for R740-P40

384.145

Open MPI

3.0.1

3.0.0

CUDA

9.1.85

9.0.176

cuDNN

7.1.3.16

7.1.4

NCCL

2.2.15

2.2.13

Docker Container

NVidia TensorFlow Docker

Nvidia TensorFlow Docker

Container Image – Single Node

TensorFlow/tensorflow:nightly-gpu-py3

nvcr.io/nvidia/tensorflow:18.06-py3

Container Image – Multi Node

Horovod : latest

n/a

Benchmark scripts

tf_cnn_benchmarks

Test Date – V1

April-June 2018

July 2018

Test Date - V2

Jan 2019

Table 4: OS & Driver Configurations