Optimizing Itanium-Based Applications (May 2011)

ManualsBrandsHP ManualsSoftwareC/aC++ Software for HP-UX

Optimizing Itanium-Based Applications

 The compiler interprocedurally propagates information about modified and referenced data items

(mod/ref analysis), which can benefit various other compiler analyses and transformations which need

to consider global side effects.

 The compiler also interprocedurally propagates range information for certain entities.

 Function inlining exposes traditional benefits, such as the reduction of call overhead, the improvement

of the locality of the executing code and the reduction of the number of branches. More importantly

though, inlining exposes additional optimization opportunities because of the widened scope and

enables better instruction scheduling.

The inliner framework has been designed to scale to very large applications, uses a novel and very fast

underlying algorithm, and employes an elaborate set of new heurisitics for its inlining decisions.

Note: The inlining engine is also employed at +O2 for intra-module inlining. At this optimization level

the inliner uses tuned down heuristics in order to guarantee fast compile times in addition to positive

performance effects.

 The whole call graph is constructed, enabling indirect call promotion, where an indirect call is

converted to a test and a direct call. Depending on the application characteristics, and in the presence

of PBO data, this can result in significant application speedups (we have observed up to 20%

improvements for certain applications)

 Dead variable removal allows the high level optimizer to reduce the total memory requirements of the

application by removing global and static variables that are never referenced.

 Recognition of global, static and local variables that are assigned but never used allows the optimizer

to remove dead code (which may result in additional dead variables).

 Conversion of global variables that are referenced only within a module allows the high level

optimizer to convert the symbol to a private symbol, guaranteeing that it can only be accessed from

within this module. This gives the low-level optimizer greater freedom in optimizing references to that

variable.

 Dead function removal (functions that are never called) and redundant function removal (for example,

duplicate template instantiations) help to reduce compile time and improve the effectiveness of cross

module inlining by reducing the working set. Additionally, as the application’s total code size reduces,

it will incur less cache and page misses (resulting in potentially higher performance)

 Short data optimizations. Global and static data allocated in the short data area can be accessed with a

more efficient access sequence. In whole program mode (-ipo) the compiler can perform precise

analysis to determine if all global and static data fits into the short data area and allocate it there. If the

data doesn’t fit, the compiler can determine the best safe short data size threshold, enabling a

maximum amount of data items to be addressable more effectively.

Note: This is an IPO advantage. At other optimization levels the same optimization can be enabled

with the option +Oshortdata. The option -ipo derives an optimal short data threshold.

 For calls to external functions (function not residing in a binary) the linker introduces a small import

stub. If the compiler knows that a function call is a call to an external function, it can inline the import

stub, resulting in better performance.

The HP compilers support a mechanism that allows annotating function prototypes with a pragma

(#pragma extern) marking those functions as external functions, enabling import stub inlining.

All this is no longer necessary with -ipo in whole program mode. In this model the compiler knows

which functions are defined by the application and which are external and automatically marks

functions appropriately.

 The compiler performs interprocedural data layout optimizations, in particular, structure splitting, array

splitting and dead field removal. If the compiler is able to determine that a given record type can be

modified safely and if additionally heuristics find that such type modifications are beneficial, the

compiler may break a record type into a cold part and a hot part with the goal of reducing cache miss

and TLB penalties.