White Papers

Dell - Internal Use - Confidential

Introduction to NVIDIA TensorRT

NVIDIA TensorRT

, previously called GIE (GPU Inference Engine), is a high performance deep learning

inference engine for production deployment of deep learning applications that maximizes inference

throughput and efficiency. TensorRT provides users the ability to take advantage of fast reduced precision

instructions provided in the Pascal GPUs. TensorRT v2 supports the new INT8 operations that are available

on both P40 and P4 GPUs, and to the best of our knowledge it is the only library that supports INT8 to

date.

Testing Methodology

This blog quantifies the performance of deep learning inference using NVIDIA TensorRT on one PowerEdge

R740 server which supports up to 3 Tesla P40 GPUs or 4 Tesla P4 GPUs. Table 2 shows the hardware and

software details. The inference benchmark we used was giexec in TensorRT sample codes. The synthetic

images, which were filled with random non-zero numbers to simulate real images, were used in this

sample code. Two classic neural networks were tested: AlexNet (2012 ImageNet winner) and GoogLeNet

(2014 ImageNet winner) which is much deeper and more complicated than AlexNet.

We measured the inference performance in images/sec which means the number of images that can be

processed per second.

Table 2: Hardware configuration and software details

Platform

PowerEdge R740

Processor

2 x Intel Xeon Gold 6150

Memory

192GB DDR4 @ 2667MHz

Disk

400GB SSD

Shared storage

9TB NFS through IPoIB on EDR Infiniband

GPU

3x Tesla P40 with 24GB GPU memory, or

4x Tesla P4 with 8 GB GPU memory

Software and Firmware

Operating System

RHEL 7.2

BIOS

0.58 (beta version)

CUDA and driver version

8.0.44 (375.20)

NVIDIA TensorRT Version

2.0 EA and 2.1 GA

Performance Evaluation

In this section, we will present the inference performance with NVIDIA TensorRT on GoogLeNet and

AlexNet. We also implemented the benchmark with MPI so that it can be run on multiple GPUs within a

server. Figure 1 and Figure 2 show the inference performance with AlexNet and GoogLeNet on up to three

P40s and four P4s in one R740 server. In these two figures, batch size 128 was used. The power

consumption of each configuration was also measured and the energy efficiency of the configurations is