Optimizing Itanium-Based Applications (May 2011)

Currently, this optimization is limited to a very restricted set of scenarions. Please use +Oinfo to

determine whether this optimization has been performed.

 The compiler can also perform non-contiguous array fusion. For some multi-dimensional, non-

contiguous, pointer-based arrays, the compiler will modify the declaration, allocation, and uses of such

arrays to instead use a contiguous memory layout. This transformation both allows for more efficient

element access, and results in more optimal cache utilization.

 Interprocedural constant promotion is performed.

 The compiler inserts inter-procedural data prefetches before callsites for data accessed in the call chain

rooted at the call site.The inserted prefetches will attempt to fetch data accessed via dereferences of

pointer parameters of the call.

The interprocedural analysis phase is also able to expose and warn on additional source problems, for

example, for variables that are declared with incompatible attributes in different source files.

The interprocedural optimization framework has been designed to scale to very large applications.

Fortunately, nothing changes from a user’s perspective, in particular, existing build processes do not have

to modified. Since the IPO and code generation is performed at link time, the link time may increase

significantly.

At the end of the IPO phase, the code generation and low-level optimization phase is started by invoking

multiple parallel processes of the binary ‘be’. The default number of parallel ‘be’ processes is set to the

number of processors on a machine. This number can be overriden by setting the environment variable

“PARALLEL”, for example:

export PARALLEL=4

loop optimizations at +O3 or +O4

The high level loop optimizer performs the following classical loop optimizations based on array access

patterns (the loop optimizer is fully enabled at +O3 or +O4, with a limited subset enabled at +O2). These

optimizations are designed to improve locality of array accesses, improving the utilization of the data

cache.

loop interchange

If the compiler finds a perfect loop nest (no statements before or after nested loops), it will analyze the

memory access patterns, which are implicitly defined by the iteration space, and determine legality and

profitability of interchanging an inner loop with an outer loop. For certain loops, this transformation can

significantly reduce data cache misses.

loop distribution

Loop distribution seeks to break a single loop into two or more loops. This transformation may remove

loop-carried dependencies, which may result in more efficient code. It is an enabler for loop interchange, as

more perfect loop nests may be generated, and it may also result in more module-scheduled loops in the

low level optimizer. Finally, this transformation may alleviate the register pressure for the low level

optimizer.

loop fusion

Loop fusion is the opposite of loop distribution, two loops are merged together into a single loop. This

transformation usually has positive effects on cache utilization when both loops access the same arrays in a

similar order.

loop unswitching

Loop unswitching (also known as if-do promotion) seeks to hoist an if statement out of a loop. If a loop

contains an if statement with a test based on the loop induction variable and a loop invariant value, it can be

beneficial to move the if before the loop and to duplicate the loop body into a first form for which the if test

was always true, and a second form for which the if test was always false. This transformation has the