White Papers
43 CheXNet – Inference with Nvidia T4 on Dell EMC PowerEdge R7425
• Using TF-TRT-FP32 with TensorRT™ (batch size=2) instead of Native TensorFlow FP32
without TensorRT™, improved throughput by ~92% (272 vs 142) at ~7ms latency target.
• Using TF-TRT-FP16 with TensorRT™ (batch size=4) improved throughput by ~362% (656 vs
142). Also, it decreases latency by ~11%, making it in 6.3ms versus 7.1ms.
• Now, when using TF-TRT-INT8 (batch size=8) we can appreciate a huge improvement in terms
of throughput keeping the ~7ms latency target, we observed a speedup of ~802% (from 1281
vs 142). This is a significant boost in performance.
• On the other hand, comparing TF-TRT-INT8 Integration versus Native TensorRT-INT8 C++
API (batch size=8) we found that there was a slightly improvement of 7% (1371 vs 1281).
It is important to highlight that there are other implementation factors that could affect the end to
end inference’s speed when deploying these models into production, so model optimization is just
one of those factors and we have demonstrated here how to do it.