White Papers
Dell - Internal Use - Confidential
Figure 2: The performance of inference with GoogLeNet on P40 and P4
`
Figure 3: P40 vs P4 for AlexNet with different batch sizes
In our previous blog, we compared the inference performance using both FP32 and INT8 and the
conclusion is that INT8 is ~3x faster than FP32. In this study, we also compare the accuracy when using
both operations to verify that using INT8 can get comparable performance to FP32. We used the latest
TensorRT 2.1 GA version to do this benchmarking. To make INT8 data encode the same information as
FP32 data, a calibration method is applied in TensorRT to convert FP32 to INT8 in a way that minimizes
the loss of information. More details of this calibration method can be found in the presentation “8-bit
1.00
2.00
3.00
0.52
1.06
1.58
2.10
1.00
1.01
1.00
1.44
1.51
1.46
1.41
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
1 GPU 2 GPUs 3 GPUs 4 GPUs
Relative Images/sec/Watt
Relative Images/sec
P40 vs P4 for GoogLeNet on R740
(batch_size=128)
Perf - P40 Perf P4 Perf/Watt - P40 Perf/Watt - P4
1.0
8.1
9.2
9.9
10.0
9.9
9.7
9.6
9.5
1.2
13.3
14.7
14.1
15.3
14.8
15.2
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
0.0
5.0
10.0
15.0
20.0
25.0
1 32 64 128 256 512 1024 2048 4096
Relative Images/sec/Watt
Relative Images/sec
Batch Size
P40 vs P4 for AlexNet (with 1 GPU)
P40 P4 Perf/Watt-P40 Perf/Watt-P4