White Papers

Figure 5: Inference performance comparison between P40 and M40
Deep learning inference can be applied in different scenarios. Some scenarios require large batch size and
some scenarios even requires no batching at all (i.e. batch size is 1). Therefore we also measured the
performance difference when using different batch sizes and the result is shown in Figure 6. Note that the
purpose here is not comparing the performance of GoogLeNet and AlexNet, instead the purpose is to
check how the performance changes with different batch sizes for each neural network. It can be seen
that without batch processing the inference performance is very low. This is because the GPU is not
assigned enough workloads to keep it busy. The larger the batch size is, the higher the inference
performance is, although the rate of the speed increasing becomes slower. When batch size is 4096,
GoogLeNet stopped running because the required GPU memory for this neural network exceeds the GPU
memory limit. But AlexNet was able to run because it is a less complicated neural network than GoogLeNet
and therefore it requires less GPU memory. So the largest batch size is only limited by GPU memory.
Figure 6: Inference performance with different batch sizes
3200
5198
16292
0
5000
10000
15000
20000
FP32 INT8
Images/sec (higher is better)
Operations mode
P40 vs M40 for AlexNet with TensorRT
(batch size=128)
M40 P40
791
11464
14438
16292
17101
17502
17768
17951
18011
594
5417
6059
6410
6630
6763
6822
6842
0
5000
10000
15000
20000
1 32 64 128 256 512 1024 2048 4096
Image/sec (higher is better)
Batch size
1x P40 INT8 AlexNet and GoogLeNet
AlexNet GoogLeNet