Administrator Guide

20 Deep Learning Performance Scale-Out
Appendix: Reproducibility
The section below walks through the setting requirements for the distributed Dell EMC system
and execution of the benchmarks. Do this for both servers:
Update Kernel on Linux
Install Kernel Headers on Linux
Install Mellanox OFED at local host
Setup Password less SSH
Configure the IP over InfiniBand (IPoIB)
Install CUDA with NVIDIA driver
install CUDA Toolkit
Download and install GPUDirect RDMA at the localhost
Check GPUDirect kernel module is properly loaded
Install Docker CE and nvidia runtime
Build - Horovod in Docker with MLNX OFED support
Check the configuration status on each server (nvidia-smi topo -m && ifconfig && ibstat
&& ibv_devinfo -v && ofed_info -s)
Pull the benchmark directory into the localhost
Mount the NFS drive with the ImageNet data
Run the system as:
On Secondary node (run this first):
$ sudo docker run --gpus all -it --network=host -v /root/.ssh:/root/.ssh --cap-
add=IPC_LOCK -v /home/dell/imagenet_tfrecords/:/data/ -v
/home/dell/benchmarks/:/benchmarks -v /etc/localtime:/etc/localtime:ro --privileged
horovod:latest-mlnxofed_gpudirect-tf1.14_cuda10.0 bash -c "/usr/sbin/sshd -p 50000;
sleep infinity"
On Primary node:
$ sudo docker run --gpus all -it --network=host -v /root/.ssh:/root/.ssh --cap-
add=IPC_LOCK -v /home/dell/imagenet_tfrecords/:/data/ -v
/home/dell/benchmarks/:/benchmarks -v /etc/localtime:/etc/localtime:ro --privileged
horovod:latest-mlnxofed_gpudirect-tf1.14_cuda10.0
Running the benchmark in single node mode with 4 GPUs:
$ python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --
device=gpu --data_format=NCHW --optimizer=sgd --distortions=false --
use_fp16=True --local_parameter_device=gpu --variable_update=replicated --
all_reduce_spec=nccl --data_dir=/data/train --data_name=imagenet --
model=ResNet-50 --batch_size=256 --num_gpus=4 --xla=true
Running the benchmark in multi node mode with 8 GPUs:
$ mpirun -np 8 -H 192.168.11.1:4,192.168.11.2:4 --allow-run-as-root -x
NCCL_NET_GDR_LEVEL=3 -x NCCL_DEBUG_SUBSYS=NET -x NCCL_IB_DISABLE=0 -mca
btl_tcp_if_include ib0 -x NCCL_SOCKET_IFNAME=ib0 -x NCCL_DEBUG=INFO -x
HOROVOD_MPI_THREADS_DISABLE=1 --bind-to none --map-by slot --mca plm_rsh_args
"-p 50000" python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --
device=gpu --data_format=NCHW --optimizer=sgd --distortions=false --
use_fp16=True --local_parameter_device=gpu --variable_update=horovod --
horovod_device=gpu --datasets_num_private_threads=4 --data_dir=/data/train --
data_name=imagenet --display_every=10 --model=ResNet-50 --batch_size=256 --
xla=True
© 2020 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC and other trademarks are trademarks of Dell Inc. or its
subsidiaries. Other trademarks may be trademarks of their respective owners.