White Papers

As in our previous deep learning blog, we still use the three most popular deep learning frameworks: NVIDIA’s fork of Caffe (NV-Caffe),
MXNet and TensorFlow. Both NV-Caffe and MXNet have been optimized for V100. TensorFlow still does not have any official release to
support V100, but we applied some patches obtained from TensorFlow developers so that it is also optimized for V100 in these tests. For
the dataset, we still use ILSVRC 2012 dataset whose training set contains 1281167 training images and 50000 validation images. When
testing neural network, we chose Resnet50 as it is a computationally intensive network. To get best performance, we used CUDA 9-rc
compiler and CUDNN library in all of the three frameworks since they are optimized for V100. The testing platform is Dell EMC’s
PowerEdge C4130 server. The C4130 server has multiple configurations, and we evaluated both P100-PCIe in configuration G and P100-
SXM2 in configuration K. The difference between configuration G and configuration K is shown in Figure 1. There are mainly two
differences: one is that configuration G has two x16 PCIe links connecting from dual CPUs to the four GPUs, while configuration K has
only one x16 PCIe bus connecting from one CPU to four GPUs; another difference is that GPUs are connected by PCIe buses in
configuration G but by NVLink in configuration K. The other hardware and software details are shown in Table 2.
Figure 1: Comparison between configure G and configure K
Table 2: The hardware configuration and software details
Platform
PowerEdge C4130 (config G and config K)
CPU
2 x Intel Xeon E5-2690 v4 @2.6GHz (Broadwell)
Memory
256GB DDR4 @ 2400MHz
Disk
9TB HDD
GPU
V100-PCIe, V100-SXM2, P100-PCIe, P100-SXM2
Software and Firmware
Operating System
RHEL 7.3 x86_64
Linux Kernel
3.10.0-514.26.2.el7.x86_64
BIOS
2.4.2
CUDA compiler and GPU driver
CUDA 9.0-RC (384.59)
NCCL
2.0
Python
2.7.5
Deep Learning Libraries and Frameworks
CUDNN
7.0
TensorRT
3.0.0
NV-Caffe
0.16.3
MXNet
0.11.0
TensorFlow
1.2.1-rc1
x16
G
2 CPU / 4 GPU
2 Virtual Switches
2 GPU per CPU
PCIe Gen3 96-lane Switch
GPU1 GPU2
LP
Slot
#2
CPU1
LP
Slot
#1
x16 x8 x16 x16 x8
X X
GPU3 GPU4
CPU2