Optimizing Itanium-Based Applications (May 2011)
Optimizing Itanium-Based Applications
7
The compiler interprocedurally propagates information about modified and referenced data items
(mod/ref analysis), which can benefit various other compiler analyses and transformations which need
to consider global side effects.
The compiler also interprocedurally propagates range information for certain entities.
Function inlining exposes traditional benefits, such as the reduction of call overhead, the improvement
of the locality of the executing code and the reduction of the number of branches. More importantly
though, inlining exposes additional optimization opportunities because of the widened scope and
enables better instruction scheduling.
The inliner framework has been designed to scale to very large applications, uses a novel and very fast
underlying algorithm, and employes an elaborate set of new heurisitics for its inlining decisions.
Note: The inlining engine is also employed at +O2 for intra-module inlining. At this optimization level
the inliner uses tuned down heuristics in order to guarantee fast compile times in addition to positive
performance effects.
The whole call graph is constructed, enabling indirect call promotion, where an indirect call is
converted to a test and a direct call. Depending on the application characteristics, and in the presence
of PBO data, this can result in significant application speedups (we have observed up to 20%
improvements for certain applications)
Dead variable removal allows the high level optimizer to reduce the total memory requirements of the
application by removing global and static variables that are never referenced.
Recognition of global, static and local variables that are assigned but never used allows the optimizer
to remove dead code (which may result in additional dead variables).
Conversion of global variables that are referenced only within a module allows the high level
optimizer to convert the symbol to a private symbol, guaranteeing that it can only be accessed from
within this module. This gives the low-level optimizer greater freedom in optimizing references to that
variable.
Dead function removal (functions that are never called) and redundant function removal (for example,
duplicate template instantiations) help to reduce compile time and improve the effectiveness of cross
module inlining by reducing the working set. Additionally, as the application’s total code size reduces,
it will incur less cache and page misses (resulting in potentially higher performance)
Short data optimizations. Global and static data allocated in the short data area can be accessed with a
more efficient access sequence. In whole program mode (-ipo) the compiler can perform precise
analysis to determine if all global and static data fits into the short data area and allocate it there. If the
data doesn’t fit, the compiler can determine the best safe short data size threshold, enabling a
maximum amount of data items to be addressable more effectively.
Note: This is an IPO advantage. At other optimization levels the same optimization can be enabled
with the option +Oshortdata. The option -ipo derives an optimal short data threshold.
For calls to external functions (function not residing in a binary) the linker introduces a small import
stub. If the compiler knows that a function call is a call to an external function, it can inline the import
stub, resulting in better performance.
The HP compilers support a mechanism that allows annotating function prototypes with a pragma
(#pragma extern) marking those functions as external functions, enabling import stub inlining.
All this is no longer necessary with -ipo in whole program mode. In this model the compiler knows
which functions are defined by the application and which are external and automatically marks
functions appropriately.
The compiler performs interprocedural data layout optimizations, in particular, structure splitting, array
splitting and dead field removal. If the compiler is able to determine that a given record type can be
modified safely and if additionally heuristics find that such type modifications are beneficial, the
compiler may break a record type into a cold part and a hot part with the goal of reducing cache miss
and TLB penalties.