White Papers

ManualsBrandsDell ManualsConverged InfrastructureServers Solution Resources

41 CheXNet – Inference with Nvidia T4 on Dell EMC PowerEdge R7425

Command line to execute the Native TensorRT™ C++ API benchmark:

./trtexec

--uff=/home/dell/chest-x-ray/output_convert_to_uff/chexnet_frozen_graph_1541777429.uff

--output= chexnet_sigmoid_tensor

--uffInput=input_tensor,3,256,256

--iterations=40 --int8 --batch=1

--device=0

--avgRuns=100

Where:

--uff=: UFF file location

--output: output tensor name

--uffInput: Input tensor name and its dimensions for UFF parser (in CHW format)

--iterations: Run N iterations

--int8: Run in int8 precision mode

--batch: Set batch size

--device: Set specific cuda device to N

--avgRuns: Set avgRuns to N - perf is measured as an average of avgRuns

Script Output sample:

Average over 100 runs is 1.4675 ms (host walltime is 1.57855 ms, 99% percentile time is 1.54624).

Average over 100 runs is 1.48153 ms (host walltime is 1.59364 ms, 99% percentile time is 1.5831).

Average over 100 runs is 1.4899 ms (host walltime is 1.6021 ms, 99% percentile time is 1.58061).

Average over 100 runs is 1.47487 ms (host walltime is 1.58658 ms, 99% percentile time is 1.56506).

Average over 100 runs is 1.47848 ms (host walltime is 1.59125 ms, 99% percentile time is 1.56266).

Average over 100 runs is 1.48204 ms (host walltime is 1.59392 ms, 99% percentile time is 1.57078).

Average over 100 runs is 1.48219 ms (host walltime is 1.59398 ms, 99% percentile time is 1.5673).

• 󰇡





󰇢 󰇡





󰇛



󰇜

󰇢   󰇡





󰇢    

• Latency (msec): 1.48219

In Figure 18 we observed that CheXNet inference optimized with Native TRT5 C++ API

performed ~2X faster than with TF-TRT Integration API optimization, this factor was exposed

only with batch size 1 and 2; the outperform of TRT5 C++ API over TF-TRT API gradually

decreased in the way the batch size increases. We are still working with the Nvidia Developer

group to find out what should be the performance of both APIs implementations.

Further, in the Figure 19 we showed the latency curves of TRT5 C++ API versus TF-TRT

API, lower latency is better, as shown by the TRT5 C++ API.

5.7 CheXNet Inference – Throughput with TensorRT™ at ~7ms Latency

Target

The ~7ms Latency Target is critical, mainly for real time applications. In this section we have

selected all those configurations that performed at that latency target, see below Table 8 with

the selected tests we have included the inference TensorFlow-FP32-CPU Only as reference

since its latency was ~115ms.