Specifications

Chapter 3 Performance-Centric Compiler Switches 21
Compiler Usage Guidelines for AMD64 Platforms
32035 Rev. 3.22 November 2007
-O3 (level-3) specifies aggressive global optimization. This optimization level performs all level-one
and level-two optimizations and enables more aggressive hoisting and scalar replacement
optimizations that may or may not be profitable.
-O4 (level-4) performs all level-1, level-2, and level-3 optimizations and enables hoisting of guarded
invariant floating point expressions.
Loop Optimization using -Munroll, -Mvect, and -Mconcur. Loop performance may be
improved through vectorization or unrolling options, and, on systems with multiple processors, by
using parallelization options.
-Munroll unrolls loops. Executing multiple instances during each loop iteration reduces branch
overhead, improving execution speed by creating better opportunities for instruction scheduling.
Using -Munroll sub-options c:number and n:number, or using -Mnounroll can control whether
and how loops are unrolled.
-Mvect option triggers the vectorizer to scan code searching for loops that are candidates for high-
level transformations such as loop distribution, loop exchange, cache tiling, and idiom recognition
(replacement of a recognizable code sequence, such as a reduction loop, with optimized code
sequences or function calls). The vectorizer transformation can be controlled by arguments to the
-Mvect option. By default, -Mvect without sub-options is equivalent to -Mvect=assoc,
cachesize:262144. Vectorization sub-options are assoc, cachesize:number, sse, and prefetch.
-Mconcur option instructs the compiler to scan code searching for loops that are candidates for auto-
parallelization. -Mconcur must be used at compile-time and link-time. The parallelizer performs
various operations that are controlled by arguments to the -Mconcur option. By default, -Mconcur
without sub-options is equivalent to -Mconcur=dist:block. Auto-Parallelization sub-options are
altcode:number, dist:block, dist:cycle, cncall, noassoc, and innermost.
Interprocedural Analysis and Optimization using -Mipa. Interprocedural analysis (IPA) can
improve performance for many programs. To compile programs with IPA use an aggregated
suboption such as -Mipa=fast. Refer to the PGI Compiler Users Guide for available sub-options.
Function Inlining using -Minline.
Inlining allows a call to a function or subroutine to be
replaced by a copy of the body of that function or subroutine.
Several -Minline sub-options determine
the selection criteria for functions to be inlined. Available sub-options are except:func, name:func,
size:number, levels:number, and lib:filename.ext.
Note that in C++ releases prior to 6.2, function
inlining does not occur unless the -Minline switch is used. Beginning with release 6.2 inlining will
occur automatically for C++ functions specified by means of the
inline keyword or methods defined
in the body of the class. Also, if C++ exceptions are not used, the --no_exceptions flag improves
performance.
3.1.4 Linking with ACML
Due to the strategic importance of the AMD multi-core processor architecture, libraries are in place to
assist developers in porting software to AMD processors. AMD Core Math Library (ACML) is
designed to “squeeze” the greatest possible performance from AMD multi-core platforms and is
integrated in all PGI Toolkits. As the number of cores increases over time, future processor