White Papers

43 CheXNet – Inference with Nvidia T4 on Dell EMC PowerEdge R7425

• Using TF-TRT-FP32 with TensorRT™ (batch size=2) instead of Native TensorFlow FP32

without TensorRT™, improved throughput by ~92% (272 vs 142) at ~7ms latency target.

• Using TF-TRT-FP16 with TensorRT™ (batch size=4) improved throughput by ~362% (656 vs

142). Also, it decreases latency by ~11%, making it in 6.3ms versus 7.1ms.

• Now, when using TF-TRT-INT8 (batch size=8) we can appreciate a huge improvement in terms

of throughput keeping the ~7ms latency target, we observed a speedup of ~802% (from 1281

vs 142). This is a significant boost in performance.

• On the other hand, comparing TF-TRT-INT8 Integration versus Native TensorRT-INT8 C++

API (batch size=8) we found that there was a slightly improvement of 7% (1371 vs 1281).

It is important to highlight that there are other implementation factors that could affect the end to

end inference’s speed when deploying these models into production, so model optimization is just

one of those factors and we have demonstrated here how to do it.