White Papers
Dell - Internal Use - Confidential
Introduction to NVIDIA TensorRT
NVIDIA TensorRT
TM
, previously called GIE (GPU Inference Engine), is a high performance deep learning
inference engine for production deployment of deep learning applications that maximizes inference
throughput and efficiency. TensorRT provides users the ability to take advantage of fast reduced precision
instructions provided in the Pascal GPUs. TensorRT v2 supports the new INT8 operations that are available
on both P40 and P4 GPUs, and to the best of our knowledge it is the only library that supports INT8 to
date.
Testing Methodology
This blog quantifies the performance of deep learning inference using NVIDIA TensorRT on one PowerEdge
R740 server which supports up to 3 Tesla P40 GPUs or 4 Tesla P4 GPUs. Table 2 shows the hardware and
software details. The inference benchmark we used was giexec in TensorRT sample codes. The synthetic
images, which were filled with random non-zero numbers to simulate real images, were used in this
sample code. Two classic neural networks were tested: AlexNet (2012 ImageNet winner) and GoogLeNet
(2014 ImageNet winner) which is much deeper and more complicated than AlexNet.
We measured the inference performance in images/sec which means the number of images that can be
processed per second.
Table 2: Hardware configuration and software details
Platform
PowerEdge R740
Processor
2 x Intel Xeon Gold 6150
Memory
192GB DDR4 @ 2667MHz
Disk
400GB SSD
Shared storage
9TB NFS through IPoIB on EDR Infiniband
GPU
3x Tesla P40 with 24GB GPU memory, or
4x Tesla P4 with 8 GB GPU memory
Software and Firmware
Operating System
RHEL 7.2
BIOS
0.58 (beta version)
CUDA and driver version
8.0.44 (375.20)
NVIDIA TensorRT Version
2.0 EA and 2.1 GA
Performance Evaluation
In this section, we will present the inference performance with NVIDIA TensorRT on GoogLeNet and
AlexNet. We also implemented the benchmark with MPI so that it can be run on multiple GPUs within a
server. Figure 1 and Figure 2 show the inference performance with AlexNet and GoogLeNet on up to three
P40s and four P4s in one R740 server. In these two figures, batch size 128 was used. The power
consumption of each configuration was also measured and the energy efficiency of the configurations is