White Papers

5 CheXNet – Inference with Nvidia T4 on Dell EMC PowerEdge R7425

1 Background & Definitions

Deploying AI applications into production sometimes requires high throughput at the lowest

latency. The models generally are trained in 32-bit floating point (fp32) precision mode but

need to be deployed for inference at lower precision mode without losing significant accuracy.

Using lower bit precision like 8-bit integer (int8) gives higher throughput because of low

memory requirements. As a solution, Nvidia has developed the TensorRT™ Inference

optimization tool, it minimizes loss of accuracy when quantizing trained model weights to int8

and during int8 computation of activations it generates inference graphs with optimal scaling

factor from fp32 to int8. We will walk through the inference optimization process with a custom

model, covering the key components involved in this project and described in the sections

below. See Figure 1

Figure 1:Inference Implementation

Deep learning

Deep Learning (DL) is a subfield of Artificial Intelligent and Machine Learning (ML), based on

methods to learn data representations; deep learning architectures like convolutional neural

networks (CNN) and Recurrent Neural Networks (RNN) among others have been

successfully applied to applications such as computer vision, speech recognition, and

machine language translation producing results comparable to human experts.

TensorFlow

Nvidia TensorRT

CheXNet Model

Dell EMC PowerEdge

R7425