Optimizing Itanium-Based Applications (May 2011)

ManualsBrandsHP ManualsSoftwareHP-UX Migration and Transition Tools

Optimizing Itanium-Based Applications 1

Optimizing Itanium-Based Applications

Version 1.11

May 16, 2011

Summary of content (25 pages)

PAGE 1
Optimizing Itanium-Based Applications Version 1.
PAGE 2
Table of Contents introduction .....................................................................................................................................................3 six levels of optimization.................................................................................................................................3 level zero .................................................................................................................................................... 3 level one .
PAGE 3
introduction The HP Itanium-based optimizer transforms code so that it runs more efficiently on Itanium-based HP-UX systems. The optimizer can dramatically improve application performance. In addition, compile time and memory resources increase with each higher level of optimization due to the increasingly complex analysis that is performed.
PAGE 4
● Debugging correctness of code is maintained. Breakpoints behave as expected and variables have expected values at breakpoints. See Section 14.27 (Debugging optimized code) in Debugging with GDB[2] for more information on this topic. level two +O2 or –O description: ● ● Performs Level 1 optimizations, plus optimizations performed over entire functions. Performs intra-module inlining with tuned down heuristics to guarantee fast compile times in addition to potential performance gains.
PAGE 5
● ● ● ● ● ● Performs Level 2 optimizations, plus optimizations across the entire application program. Performs interprocedural optimizations (IPO) at link time, including improved range propagation and alias analysis, cross module inlining, interprocedural data prefetching, dead variable and dead function removal, variable privatization, short data optimization, data layout optimization, constant propagation, and import stub inlining.
PAGE 6
● Better alias information and inlining improves and enables additional loop transformations. interprocedural optimizations with -ipo The HP high level optimizer contains an interprocedural optimizer, a high level loop optimizer, and a scalar optimizer. The interprocedural optimizer is enabled with the option -ipo at optimization levels two or higher (e.g. +O2 -ipo). Optimization level four (option +O4) implies -ipo.
PAGE 7
 The compiler interprocedurally propagates information about modified and referenced data items (mod/ref analysis), which can benefit various other compiler analyses and transformations which need to consider global side effects.  The compiler also interprocedurally propagates range information for certain entities.
PAGE 8
Currently, this optimization is limited to a very restricted set of scenarions. Please use +Oinfo to determine whether this optimization has been performed.  The compiler can also perform non-contiguous array fusion. For some multi-dimensional, noncontiguous, pointer-based arrays, the compiler will modify the declaration, allocation, and uses of such arrays to instead use a contiguous memory layout.
PAGE 9
effect that the if-statement is now executed only when the loop is reached, and no longer on every loop iteration. loop cloning Loop cloning seeks to special case loops with variable trip counts with help of profile information. For example, if a loop iterates from 0 to N, but the profile information hints that the loop most of the time executes with a constant trip count C, it can be beneficial to special case the loop for C and to check for this value at runtime to select the proper loop variant.
PAGE 10
enabling aggressive optimizations +Ofast or -fast (-fast is not supported by Fortran) description: [Alias for +O2 +Onolimit +Ofltacc=relaxed +FPD +DSnative +Wl,+pi,1M +Wl,+pd,1M –Wl,+mergeseg] Enables aggressive optimizations at +O2. This option is safe for the vast majority of applications, but can result in higher compile time or, for codes with strict FP accuracy needs, incorrect output.
PAGE 11
Users can remove optimization time restrictions at +O2 and above by using the +Onolimit or +Olimit=none option. This allows full optimization of large procedures, but can incur significant compile time increases for very large procedures, especially those with large sequences of straight-line code. If you are willing to tolerate longer compile times, +Onolimit can result in significant performance improvements.
PAGE 12
On Itanium, the benefit of forming these contractions can be significant. Contractions can be enabled and disabled in different blocks of code using the FP_CONTRACT pragma. FP_CONTRACT OFF overrides any prior pragma or +Ofltacc=strict option. FP_CONTRACT ON has no effect other than undoing a prior FP_CONTRACT OFF, and is overridden by +Ofltacc=strict. +Ofltacc=limited enables a small number of other value-changing optimizations in addition to the contractions.
PAGE 13
+O[no]libmerrno (default +Onolibmerrno, except with C’s –Aa, c89, or –AC89 the default is +Olibmerrno) Enables support for errno in libm functions. Different, less optimal versions of libm functions are invoked under +Olibmerrno. Additionally, the optimizer is prohibited from performing optimizations of these calls (such as coalescing calls to the same libm function with identical inputs) because they are no longer side-effect-free. Under C’s –Aa, c89, or –AC89, the default becomes +Olibmerrno.
PAGE 14
You can use this option or pragma to obtain the most optimized access sequences for data and code symbols. Symbols with the given name(s) are specified as having protected export class. If no symbols are given, then all symbols, including those referenced but not defined in the translation unit, are specified as having protected export class. This means that these symbols are not preempted and can be optimized as such. For example, the compiler can bypass the linkage table for both code and data references.
PAGE 15
#pragma hidden symbol[,symbol] #pragma binding hidden The symbols with the given name or names are specified as having hidden export class. If no symbols are given with –Bhidden, all symbols, including those referenced but not defined in the translation unit, are specified as having hidden export class. The #pragma binding hidden applies to all globally-scoped symbols following the pragma, prior to the next #pragma binding.
PAGE 16
-exec Asserts that code is being compiled for an executable. Similar to -Bprotected_def, all locally defined symbols are marked as having protected export class. Additionally, accesses to symbols known to be defined in the executable can be materialized with absolute addressing, rather than linkage table accesses. -minshared Equivalent to -Bprotected -exec.
PAGE 17
With profile data, the compiler may also insert stride prefetches for linked-list traversals that have regular runtime address strides. Consider the following source code example: for (p = ptr; p != 0; p = p->next) x += p->data; Normally, the compiler cannot insert prefetches for later iterations of the loop without dereferencing successive values of the next field. However, profile data may indicate that the values of the p pointer have a regular address stride in virtual memory.
PAGE 18
+inline_level n Fine tunes the aggressiveness of the inliner. The value of can be in the range 0.0-9.0 with 0.1 increments. The following values/ranges have special meaning:       0.0: No inlining is done (same as +d). 1.0: Only functions marked with the inline keyword or implied by the language to be inline are considered for inlining. 1.0 < num < 2.0 : increasingly make inliner more aggressive below the default level. 2.0: Default level of inlining for +O2,+O3,+O4. 2.0 < num < 9.
PAGE 19
With +Onoparmsoverlap, the optimizer assumes that subprogram arguments do not refer to overlapping memory locations. This allows more aggressive optimization and scheduling of pointerintensive code. +O[no]parminit (default +Onoparminit) Not supported for Fortran. When enabled, the optimizer inserts instructions to initialize to zero any unspecified function parameters at call sites. This avoids NaT values in parameter registers.
PAGE 20
When the +Oautopar option is used at optimization levels +O3 and above, the compiler will automatically parallelize those loops which are deemed safe and profitable by the loop transformer. This optimization allows the compiled program to take advantage of more than one processor (or core) when executing loops determined to be parallelizable.
PAGE 21
cc -o sample.exe +Oprofile=use -O sample.o Link for optimization. The +Oprofile=use option is supported at optimization level 2 (-O or +O2) and above. Note: Profile-based optimization has a greater impact on application performance at each higher level of optimization. Profile-based optimization should be enabled during the final stages of application development. To obtain the best performance, re-profile and re-optimize your application after making source code changes.
PAGE 22
% ./program.exe < A.input % mv flow.data A.flow % ./program.exe < B.input % mv flow.data B.flow % /opt/langtools/bin/fdm A.flow A.flow –o /tmp/program.flow The two sequences above (implicit and explicit) will result in the same final profile, modulo sampling effects. locking of profile database files When an instrumented application completes execution and begins writing to the “flow.
PAGE 23
compiler-generated performance advice The compiler will emit performance-related advice when +wperfadvice[=1|2|3|4] is specified (+wperfadvice is equivalent to +wperfadvice=2). The fewest, easiest to correct advice messages are emitted at level 1. More suggestions are emitted with higher levels, and those emitted by levels 3 and 4 may require extensive or complicated source code changes to achieve performance benefits.
PAGE 24
Index A access sequences, optimized, 14 aggressive optimization safety of, 10 aggressive optimization, enabling, 10 aggressively schedule code, 10 archive library, 14 C compilation time limits, removing, 10 controlling optimization, 3 cross-region addressing, enabling/disabling, 18 interprocedural optimizations, 6 ipo.
PAGE 25
References [1] HP Compilers for HP Integrity Servers, http://h21007.www2.hp.com/portal/download/files/unprot/Itanium/CompilersTechOverview.pdf, 2011. [2] R. Stallman, R. Pesch, S. Shebs, et al., Debugging with GDB, HP 18th Edition http://h21007.www2.hp.com/portal/download/files/unprot/devresource/Tools/wdb/doc/gdb60.pdf, Sep 2008. [3] David Gross, Library Providers’ Guide to Symbol Binding, http://h21007.www2.hp.com/portal/download/files/unprot/Itanium/Lib-prov-guide.