Administrator Guide

9 Deep Learning Performance Scale-Out
Figure 6 Multi Node PowerEdge C4140-M Several CNN Models TF 1.10 vs TF 1.14 (Speedup factor)
Performance Gain with XLA
Since there was not much performance gain with the basic configuration, we decided to explore
the limits of GPU performance using other parameters. We looked at XLA (Accelerated Linear
Algebra) [3], by adding the flag xla=true at the script level. By default, the TensorFlow graph
executor “executes” the operations with individual kernels, one kernel for the multiplication, one
kernel for the addition and one for the reduction. With XLA, these operations are fused” in just
one kernel; keeping the intermediate and final results in the GPU, reducing memory operations,
and therefore improving performance.
See below Figure 7 for results and Figure 8 for speedup factors across the models. The models
inception-v4, inception-v3 and ResNet-50 showed much better performance using XLA, with
speedup factors from 1.35X up to 1.43X. Since the ResNet-50 model is most widely used, we
used it to continue the rest of the tests.