White Papers
38 CheXNet – Inference with Nvidia T4 on Dell EMC PowerEdge R7425 | Document ID
See the Table 5 with the consolidated results of the CheXNet Inference in Native TensorFlow
FP32 mode versus TF-TRT 5.0 Integration INT8int8, in terms of throughput and latency. We
observed the huge different when running the test in different configurations. For speedup
factors see the next tables.
Table 5. Throughput and Latency Native TensorFlow FP32 versus TF-TRT 5.0 Integration INT8
Batch
Size
TF-TRT INT8
Native TensorFlow FP32-GPU
Native TensorFlow FP32- CPU Only
Throughput
(img/sec)
Latency
(ms)
Throughput
(img/sec)
Latency
(ms)
Throughput
(img/sec)
Latency
(ms)
1
315
3
142
7
9
115
2
544
4
198
10
11
195
4
901
5
251
16
14
292
8
1281
7
284
28
19
431
16
1456
11
307
55
22
755
32
1549
21
329
98
25
1356
In Table 6 we have calculated the speedup factor of TF-TRT 5.0 Integration INT8 versus
Native TensorFlow FP32-GPU. The server PowerEdge R7425-T4 performed in average up
to 4X faster than native TensorFlow-GPU when accelerating the workloads with TF-TRT
Integration.
Table 6. PowerEdge R7425-T4 Speedup Factor with TF-TRT versus native TensorFlow-GPU
Batch Size
TF-TRT INT8
Native TensorFlow FP32-
GPU
Speedup Factor X
Throughput (img/sec)
Throughput (img/sec)
1
315
142
2X
2
544
198
3X
4
901
251
4X
8
1281
284
5X
16
1456
307
5X
32
1549
329
5X
Average
4X
In Table 7 we have calculated the speedup factor of TF-TRT 5.0 Integration INT8 versus
Native TensorFlow FP32-CPU Only. The server PowerEdge R7425-T4 performed in average
up to 58X faster than native TensorFlow-CPU Only when accelerating the workloads with
TF-TRT Integration
Table 7. PowerEdge R7425-T4 Speedup Factor with TF-TRT versus native TensorFlow-CPU Only
Batch Size
TF-TRT INT8
Native TensorFlow FP32-
CPU Only
Speedup Factor X
Throughput (img/sec)
Throughput (img/sec)
1
315
9
35X
2
544
11
51X
4
901
14
63X
8
1281
19
67X
16
1456
22
66X
32
1549
25
63X
Average
58X