White Papers

41 CheXNet Inference with Nvidia T4 on Dell EMC PowerEdge R7425
Command line to execute the Native TensorRT™ C++ API benchmark:
./trtexec
--uff=/home/dell/chest-x-ray/output_convert_to_uff/chexnet_frozen_graph_1541777429.uff
--output= chexnet_sigmoid_tensor
--uffInput=input_tensor,3,256,256
--iterations=40 --int8 --batch=1
--device=0
--avgRuns=100
Where:
--uff=: UFF file location
--output: output tensor name
--uffInput: Input tensor name and its dimensions for UFF parser (in CHW format)
--iterations: Run N iterations
--int8: Run in int8 precision mode
--batch: Set batch size
--device: Set specific cuda device to N
--avgRuns: Set avgRuns to N - perf is measured as an average of avgRuns
Script Output sample:
Average over 100 runs is 1.4675 ms (host walltime is 1.57855 ms, 99% percentile time is 1.54624).
Average over 100 runs is 1.48153 ms (host walltime is 1.59364 ms, 99% percentile time is 1.5831).
Average over 100 runs is 1.4899 ms (host walltime is 1.6021 ms, 99% percentile time is 1.58061).
Average over 100 runs is 1.47487 ms (host walltime is 1.58658 ms, 99% percentile time is 1.56506).
Average over 100 runs is 1.47848 ms (host walltime is 1.59125 ms, 99% percentile time is 1.56266).
Average over 100 runs is 1.48204 ms (host walltime is 1.59392 ms, 99% percentile time is 1.57078).
Average over 100 runs is 1.48219 ms (host walltime is 1.59398 ms, 99% percentile time is 1.5673).
󰇡


󰇢 󰇡


󰇛

󰇜
󰇢  󰇡

󰇢  
Latency (msec): 1.48219
In Figure 18 we observed that CheXNet inference optimized with Native TRT5 C++ API
performed ~2X faster than with TF-TRT Integration API optimization, this factor was exposed
only with batch size 1 and 2; the outperform of TRT5 C++ API over TF-TRT API gradually
decreased in the way the batch size increases. We are still working with the Nvidia Developer
group to find out what should be the performance of both APIs implementations.
Further, in the Figure 19 we showed the latency curves of TRT5 C++ API versus TF-TRT
API, lower latency is better, as shown by the TRT5 C++ API.
5.7 CheXNet Inference Throughput with TensorRT™ at ~7ms Latency
Target
The ~7ms Latency Target is critical, mainly for real time applications. In this section we have
selected all those configurations that performed at that latency target, see below Table 8 with
the selected tests we have included the inference TensorFlow-FP32-CPU Only as reference
since its latency was ~115ms.