Optimizing Itanium-Based Applications (May 2011)

Optimizing Itanium-Based Applications

effect that the if-statement is now executed only when the loop is reached, and no longer on every loop

iteration.

loop cloning

Loop cloning seeks to special case loops with variable trip counts with help of profile information. For

example, if a loop iterates from 0 to N, but the profile information hints that the loop most of the time

executes with a constant trip count C, it can be beneficial to special case the loop for C and to check for

this value at runtime to select the proper loop variant. A loop with known trip count can be scheduled most

effectively by the low level optimizer, which can result in dramatic runtime improvements.

loop unrolling

The high level optimizer performs full outer loop unrolling for loops with small trip counts.

loop unroll and jam

The loop unroll and jam transformation performs outer loop unrolling and fusion, which increases

opportunities for scalar replacement. This can reduce the number of memory operations, resulting in better

instruction scheduling.

recognition of memset/memcpy type loops

For loops that essentially copy blocks of data to another memory location, the compiler determines loop

properties, such as the direction of the copy, and then replaces the whole loop with a direct call to a highly

specialized and optimized copy routine.

loop rerolling

Some user code contains manually unrolled loops. These forms of manual unrolling usually comes from

tuning efforts on a particular machine. However, on a different machine, this manually unrolled code may

perform poorly! The compiler tries to identify such unrolled loops, rerolls them by removing incremental

statements and adjusting the loop boundaries and increment. If such a rerolled loop is then passed through

the loop optimizer, better unrolling decisions can be made, depending on machine characteristics. After

loop rerolling, a loop merging pass is run to merge manually unrolled loops and their remainder loops.

loop blocking

Loop blocking is a combination of strip mining and interchange that maximizes data localization. It is

provided primarily to deal with nested loops that manipulate arrays that are too large to fit into the cache.

Under certain circumstances, loop blocking allows reuse of these arrays by transforming the loops that

manipulate them so that they manipulate strips of the arrays that fit into the cache. Effectively, a

blocked loop accesses array elements in sections that are optimally sized to fit in the cache.

scalar replacement

The optimizer finds reuses of array locations in a loop an replaces them with uses of scalar temporaries.

These temporaries can be register promoted to reduce memory acceeses.

loop multiversioning

The loop optimizer can find that some optimizations can be performed on the loop if some conditions are

met (eg: two array references do not overlap). However, some of these conditions may not be known at

compile time. The optimizer can clone the loop, introduce runtime checks for these conditions and optimize

the cloned loop more aggresively.

malloc combining

The optimizer can combine several small block allocations in a loop into a single large block allocation.

This improves locality and reduces the cost of calling the allocation routine.

advanced optimization options and pragmas

The information in the following sections describes several options for enabling optimization.