White Papers

5 CheXNet Inference with Nvidia T4 on Dell EMC PowerEdge R7425
1 Background & Definitions
Deploying AI applications into production sometimes requires high throughput at the lowest
latency. The models generally are trained in 32-bit floating point (fp32) precision mode but
need to be deployed for inference at lower precision mode without losing significant accuracy.
Using lower bit precision like 8-bit integer (int8) gives higher throughput because of low
memory requirements. As a solution, Nvidia has developed the TensorRT Inference
optimization tool, it minimizes loss of accuracy when quantizing trained model weights to int8
and during int8 computation of activations it generates inference graphs with optimal scaling
factor from fp32 to int8. We will walk through the inference optimization process with a custom
model, covering the key components involved in this project and described in the sections
below. See Figure 1
Figure 1:Inference Implementation
Deep learning
Deep Learning (DL) is a subfield of Artificial Intelligent and Machine Learning (ML), based on
methods to learn data representations; deep learning architectures like convolutional neural
networks (CNN) and Recurrent Neural Networks (RNN) among others have been
successfully applied to applications such as computer vision, speech recognition, and
machine language translation producing results comparable to human experts.
TensorFlow
Nvidia TensorRT
CheXNet Model
Dell EMC PowerEdge
R7425