Optimizing Itanium-Based Applications (May 2011)
Optimizing Itanium-Based Applications
9
effect that the if-statement is now executed only when the loop is reached, and no longer on every loop
iteration.
loop cloning
Loop cloning seeks to special case loops with variable trip counts with help of profile information. For
example, if a loop iterates from 0 to N, but the profile information hints that the loop most of the time
executes with a constant trip count C, it can be beneficial to special case the loop for C and to check for
this value at runtime to select the proper loop variant. A loop with known trip count can be scheduled most
effectively by the low level optimizer, which can result in dramatic runtime improvements.
loop unrolling
The high level optimizer performs full outer loop unrolling for loops with small trip counts.
loop unroll and jam
The loop unroll and jam transformation performs outer loop unrolling and fusion, which increases
opportunities for scalar replacement. This can reduce the number of memory operations, resulting in better
instruction scheduling.
recognition of memset/memcpy type loops
For loops that essentially copy blocks of data to another memory location, the compiler determines loop
properties, such as the direction of the copy, and then replaces the whole loop with a direct call to a highly
specialized and optimized copy routine.
loop rerolling
Some user code contains manually unrolled loops. These forms of manual unrolling usually comes from
tuning efforts on a particular machine. However, on a different machine, this manually unrolled code may
perform poorly! The compiler tries to identify such unrolled loops, rerolls them by removing incremental
statements and adjusting the loop boundaries and increment. If such a rerolled loop is then passed through
the loop optimizer, better unrolling decisions can be made, depending on machine characteristics. After
loop rerolling, a loop merging pass is run to merge manually unrolled loops and their remainder loops.
loop blocking
Loop blocking is a combination of strip mining and interchange that maximizes data localization. It is
provided primarily to deal with nested loops that manipulate arrays that are too large to fit into the cache.
Under certain circumstances, loop blocking allows reuse of these arrays by transforming the loops that
manipulate them so that they manipulate strips of the arrays that fit into the cache. Effectively, a
blocked loop accesses array elements in sections that are optimally sized to fit in the cache.
scalar replacement
The optimizer finds reuses of array locations in a loop an replaces them with uses of scalar temporaries.
These temporaries can be register promoted to reduce memory acceeses.
loop multiversioning
The loop optimizer can find that some optimizations can be performed on the loop if some conditions are
met (eg: two array references do not overlap). However, some of these conditions may not be known at
compile time. The optimizer can clone the loop, introduce runtime checks for these conditions and optimize
the cloned loop more aggresively.
malloc combining
The optimizer can combine several small block allocations in a loop into a single large block allocation.
This improves locality and reduces the cost of calling the allocation routine.
advanced optimization options and pragmas
The information in the following sections describes several options for enabling optimization.