White Papers

Dell - Internal Use - Confidential
plotted as a “performance per watt” metric. The power consumption was measured by subtracting the
power when the system was idle from the power when running the inference. Both the images/sec and
images/sec/watt metrics numbers are relative to the numbers on one P40. Figure 3 shows the
performance with different batch sizes with 1 GPU, and both metrics numbers are relative to the numbers
on P40 with batch size 1. In all figures, INT8 operations were used. The following conclusions can be
observed:
Performance: with the same number of GPUs, the inference performance on P4 is around half of
that on P40. This is consistent with the theoretical INT8 performance on both types of GPUs: 22
TIOPS on P4 vs 47 TIOPS on P40 on single GPU. Also since inference with larger batch sizes gives
higher overall throughput but consumes more memory, and P4 has only 8GB memory compared
to P40 24GB memory, P4 could not complete the inference with batch size 2048 or larger.
Scalability: the performance scales linearly on both P40s and P4s when multiple GPUs are used,
because of no communication happens between the GPUs used in the test.
Efficiency (performance/watt): the performance/watt on P4 is ~1.5x than that on P40. This is also
consistent with the theoretical efficiency difference. Because the theoretical performance of P4
is 1/2 of P40 and its TDP is around 1/3 of P40 (75W vs 250W), therefore its performance/watt is
~1.5x than P40.
Figure 1: The inference performance with AlexNet on P40 and P4
1.00
2.00
2.99
0.52
1.05
1.56
2.09
1.00
1.02
1.00
1.42
1.59
1.55
1.52
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
1 GPU 2 GPUs 3 GPUs 4 GPUs
Relative Images/sec/Watt
Relative Images/sec
P40 vs P4 for AlexNet on R740
(batch_size=128)
Perf - P40 Perf P4 Perf/Watt - P40 Perf/Watt - P4