Administrator Guide

ManualsBrandsDell ManualsConverged InfrastructureHigh Performance Computing Solution Resources

20 Deep Learning Performance Scale-Out

Appendix: Reproducibility

The section below walks through the setting requirements for the distributed Dell EMC system

and execution of the benchmarks. Do this for both servers:

• Update Kernel on Linux

• Install Kernel Headers on Linux

• Install Mellanox OFED at local host

• Setup Password less SSH

• Configure the IP over InfiniBand (IPoIB)

• Install CUDA with NVIDIA driver

• install CUDA Toolkit

• Download and install GPUDirect RDMA at the localhost

• Check GPUDirect kernel module is properly loaded

• Install Docker CE and nvidia runtime

• Build - Horovod in Docker with MLNX OFED support

• Check the configuration status on each server (nvidia-smi topo -m && ifconfig && ibstat

&& ibv_devinfo -v && ofed_info -s)

• Pull the benchmark directory into the localhost

• Mount the NFS drive with the ImageNet data

Run the system as:

On Secondary node (run this first):

$ sudo docker run --gpus all -it --network=host -v /root/.ssh:/root/.ssh --cap-

add=IPC_LOCK -v /home/dell/imagenet_tfrecords/:/data/ -v

/home/dell/benchmarks/:/benchmarks -v /etc/localtime:/etc/localtime:ro --privileged

horovod:latest-mlnxofed_gpudirect-tf1.14_cuda10.0 bash -c "/usr/sbin/sshd -p 50000;

sleep infinity"

On Primary node:

$ sudo docker run --gpus all -it --network=host -v /root/.ssh:/root/.ssh --cap-

add=IPC_LOCK -v /home/dell/imagenet_tfrecords/:/data/ -v

/home/dell/benchmarks/:/benchmarks -v /etc/localtime:/etc/localtime:ro --privileged

horovod:latest-mlnxofed_gpudirect-tf1.14_cuda10.0

• Running the benchmark in single node mode with 4 GPUs:

$ python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --

device=gpu --data_format=NCHW --optimizer=sgd --distortions=false --

use_fp16=True --local_parameter_device=gpu --variable_update=replicated --

all_reduce_spec=nccl --data_dir=/data/train --data_name=imagenet --

model=ResNet-50 --batch_size=256 --num_gpus=4 --xla=true

• Running the benchmark in multi node mode with 8 GPUs:

$ mpirun -np 8 -H 192.168.11.1:4,192.168.11.2:4 --allow-run-as-root -x

NCCL_NET_GDR_LEVEL=3 -x NCCL_DEBUG_SUBSYS=NET -x NCCL_IB_DISABLE=0 -mca

btl_tcp_if_include ib0 -x NCCL_SOCKET_IFNAME=ib0 -x NCCL_DEBUG=INFO -x

HOROVOD_MPI_THREADS_DISABLE=1 --bind-to none --map-by slot --mca plm_rsh_args

"-p 50000" python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --

device=gpu --data_format=NCHW --optimizer=sgd --distortions=false --

use_fp16=True --local_parameter_device=gpu --variable_update=horovod --

horovod_device=gpu --datasets_num_private_threads=4 --data_dir=/data/train --

data_name=imagenet --display_every=10 --model=ResNet-50 --batch_size=256 --

xla=True

subsidiaries. Other trademarks may be trademarks of their respective owners.