Administrator Guide

9 Deep Learning Performance Scale-Out

Figure 6 Multi Node PowerEdge C4140-M – Several CNN Models TF 1.10 vs TF 1.14 (Speedup factor)

Performance Gain with XLA

Since there was not much performance gain with the basic configuration, we decided to explore

the limits of GPU performance using other parameters. We looked at XLA (Accelerated Linear

Algebra) [3], by adding the flag –xla=true at the script level. By default, the TensorFlow graph

executor “executes” the operations with individual kernels, one kernel for the multiplication, one

kernel for the addition and one for the reduction. With XLA, these operations are “fused” in just

one kernel; keeping the intermediate and final results in the GPU, reducing memory operations,

and therefore improving performance.

See below Figure 7 for results and Figure 8 for speedup factors across the models. The models

inception-v4, inception-v3 and ResNet-50 showed much better performance using XLA, with

speedup factors from 1.35X up to 1.43X. Since the ResNet-50 model is most widely used, we

used it to continue the rest of the tests.