White Papers

Table 1: Hardware configuration and software details

Platform

PowerEdge C4130 (configuration G)

Processor

2 x Intel Xeon CPU E5-2690 v4 @2.6GHz (Broadwell)

Memory

256GB DDR4 @ 2400MHz

Disk

400GB SSD

GPU

4x Tesla P40 with 24GB GPU memory

Software and Firmware

Operating System

Ubuntu 14.04

BIOS

2.3.3

CUDA and driver version

8.0.44 (375.20)

TensorRT Version

2.0 EA

Table 2: Comparison between Tesla M40 and P40

Tesla M40

Tesla P40

INT8 (TIOP/s)

N/A

47.0

FP32 (TFLOP/s)

6.8

11.8

Performance Evaluation

In this section, we will present the inference performance with TensorRT on GoogLeNet and AlexNet. We

also implemented the benchmark with MPI so that it can be run on multiple P40 GPUs within a node. We

will also compare the performance of P40 with M40. Lastly we will show the performance impact when

using different batch sizes.

Figure 1 shows the inference performance with TensorRT library for both GoogLeNet and AlexNet. We

can see that INT8 mode is ~3x faster than FP32 in both neural networks. This is expected since the

theoretical speedup of INT8 is 4x compared to FP32 if only multiplications are performed and no other

overhead is incurred. However, there are kernel launches, occupancy limits, data movement and math

other than multiplications, so the speedup is reduced to about 3x faster.