White Papers

ManualsBrandsDell ManualsConverged InfrastructureHigh Performance Computing Solution Resources

Dell - Internal Use - Confidential

plotted as a “performance per watt” metric. The power consumption was measured by subtracting the

power when the system was idle from the power when running the inference. Both the images/sec and

images/sec/watt metrics numbers are relative to the numbers on one P40. Figure 3 shows the

performance with different batch sizes with 1 GPU, and both metrics numbers are relative to the numbers

on P40 with batch size 1. In all figures, INT8 operations were used. The following conclusions can be

observed:

 Performance: with the same number of GPUs, the inference performance on P4 is around half of

that on P40. This is consistent with the theoretical INT8 performance on both types of GPUs: 22

TIOPS on P4 vs 47 TIOPS on P40 on single GPU. Also since inference with larger batch sizes gives

higher overall throughput but consumes more memory, and P4 has only 8GB memory compared

to P40 24GB memory, P4 could not complete the inference with batch size 2048 or larger.

 Scalability: the performance scales linearly on both P40s and P4s when multiple GPUs are used,

because of no communication happens between the GPUs used in the test.

 Efficiency (performance/watt): the performance/watt on P4 is ~1.5x than that on P40. This is also

consistent with the theoretical efficiency difference. Because the theoretical performance of P4

is 1/2 of P40 and its TDP is around 1/3 of P40 (75W vs 250W), therefore its performance/watt is

~1.5x than P40.

Figure 1: The inference performance with AlexNet on P40 and P4

1.00

2.00

2.99

0.52

1.05

1.56

2.09

1.00

1.02

1.00

1.42

1.59

1.55

1.52

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

1 GPU 2 GPUs 3 GPUs 4 GPUs

Relative Images/sec/Watt

Relative Images/sec

P40 vs P4 for AlexNet on R740

(batch_size=128)

Perf - P40 Perf P4 Perf/Watt - P40 Perf/Watt - P4