White Papers

Deep Learning Performance: Scale-up vs Scale-out
Architectures & Technologies Dell EMC | Infrastructure Solutions Group
21
In Figure 14 we see how the GPU memory is accessed directly instead of copying the data n times
across the system components with the use of GPUDirect RDMA, this feature is reflected directly
in the throughput performance of the server.
Figure 14: Nvidia GPU Direct RDMA Connection. Source: https://www.sc-asia.org
6.2 Evaluation Platform Setup
Table 4 shows the software stack configuration used to build the environment to run the tests.
Software Stack
PowerEdge Servers
Non-Dell EMC Servers
OS
Ubuntu 16.04.4 LTS
Ubuntu 16.04.3 LTS
Kernel
GNU/Linux 4.4.0-128-generic x86_64
GNU/Linux 4.4.0-130-generic x86_64
nvidia driver
396.26 for all servers
390.46 for R740-P40
384.145
Open MPI
3.0.1
3.0.0
CUDA
9.1.85
9.0.176
cuDNN
7.1.3.16
7.1.4
NCCL
2.2.15
2.2.13
Docker Container
NVidia TensorFlow Docker
Nvidia TensorFlow Docker
Container Image Single Node
TensorFlow/tensorflow:nightly-gpu-py3
nvcr.io/nvidia/tensorflow:18.06-py3
Container Image Multi Node
Horovod : latest
n/a
Benchmark scripts
tf_cnn_benchmarks
tf_cnn_benchmarks
Test Date V1
April-June 2018
July 2018
Test Date - V2
Jan 2019
NA
Table 4: OS & Driver Configurations