White Papers
Ready Solutions Engineering Test Results
The Figure 1, Figure 2 and Figure 3 show the Resnet50 performance and speedup of multiple V100 GPUs with Caffe2, MXNet and
TensorFlow, respectively. We can obtain the following conclusions based on these results:
Overall the performance of Resnet50 scales well on multiple V100 GPUs within one node. With 3 V100:
o Caffe2 achieved the speedup of 2.61x and 2.65x in FP32 and FP16 mode, respectively.
o MXNet achieved the speedup of 2.87x and 2.82x in FP32 and FP16 mode, respectively.
o Horovod+TensorFlow achieved the speedup of 2.12x in FP32 mode. (FP16 still under development)
The performance in FP16 mode is around 80%-90% faster than FP32 for both Caffe2 and MXNet. TensorFlow still has not
supported FP16 yet, so we will test its FP16 performance once this feature is supported.
Figure 1: Caffe2: Performance and speedup of V100
Figure 2: MXNet: Performance and speedup of V100
351
686
1006
653
1303
1839
1
1.95
2.87
1
2.00
2.82
0
1
1
2
2
3
3
4
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 V100 2 V100 3 V100
Speedup
Images/sec
MXNet Resnet50
perf-FP32 perf-FP16 speedup-FP32 speedup-FP16
290
539
756
527
1014
1394
1
1.86
2.61
1
1.92
2.65
0
0.5
1
1.5
2
2.5
3
0
200
400
600
800
1000
1200
1400
1600
1 V100 2 V100 3 V100
Speedup
Images/sec
Caffe2 Resnet50
perf-FP32 perf-FP16 speedup-FP32 speedup-FP16