Optimizing Itanium-Based Applications (May 2011)
8
Currently, this optimization is limited to a very restricted set of scenarions. Please use +Oinfo to
determine whether this optimization has been performed.
The compiler can also perform non-contiguous array fusion. For some multi-dimensional, non-
contiguous, pointer-based arrays, the compiler will modify the declaration, allocation, and uses of such
arrays to instead use a contiguous memory layout. This transformation both allows for more efficient
element access, and results in more optimal cache utilization.
Interprocedural constant promotion is performed.
The compiler inserts inter-procedural data prefetches before callsites for data accessed in the call chain
rooted at the call site.The inserted prefetches will attempt to fetch data accessed via dereferences of
pointer parameters of the call.
The interprocedural analysis phase is also able to expose and warn on additional source problems, for
example, for variables that are declared with incompatible attributes in different source files.
The interprocedural optimization framework has been designed to scale to very large applications.
Fortunately, nothing changes from a user’s perspective, in particular, existing build processes do not have
to modified. Since the IPO and code generation is performed at link time, the link time may increase
significantly.
At the end of the IPO phase, the code generation and low-level optimization phase is started by invoking
multiple parallel processes of the binary ‘be’. The default number of parallel ‘be’ processes is set to the
number of processors on a machine. This number can be overriden by setting the environment variable
“PARALLEL”, for example:
export PARALLEL=4
loop optimizations at +O3 or +O4
The high level loop optimizer performs the following classical loop optimizations based on array access
patterns (the loop optimizer is fully enabled at +O3 or +O4, with a limited subset enabled at +O2). These
optimizations are designed to improve locality of array accesses, improving the utilization of the data
cache.
loop interchange
If the compiler finds a perfect loop nest (no statements before or after nested loops), it will analyze the
memory access patterns, which are implicitly defined by the iteration space, and determine legality and
profitability of interchanging an inner loop with an outer loop. For certain loops, this transformation can
significantly reduce data cache misses.
loop distribution
Loop distribution seeks to break a single loop into two or more loops. This transformation may remove
loop-carried dependencies, which may result in more efficient code. It is an enabler for loop interchange, as
more perfect loop nests may be generated, and it may also result in more module-scheduled loops in the
low level optimizer. Finally, this transformation may alleviate the register pressure for the low level
optimizer.
loop fusion
Loop fusion is the opposite of loop distribution, two loops are merged together into a single loop. This
transformation usually has positive effects on cache utilization when both loops access the same arrays in a
similar order.
loop unswitching
Loop unswitching (also known as if-do promotion) seeks to hoist an if statement out of a loop. If a loop
contains an if statement with a test based on the loop induction variable and a loop invariant value, it can be
beneficial to move the if before the loop and to duplicate the loop body into a first form for which the if test
was always true, and a second form for which the if test was always false. This transformation has the