Specifications

Chapter 3 Performance-Centric Compiler Switches 21

Compiler Usage Guidelines for AMD64 Platforms

32035 Rev. 3.22 November 2007

-O3 (level-3) specifies aggressive global optimization. This optimization level performs all level-one

and level-two optimizations and enables more aggressive hoisting and scalar replacement

optimizations that may or may not be profitable.

-O4 (level-4) performs all level-1, level-2, and level-3 optimizations and enables hoisting of guarded

invariant floating point expressions.

Loop Optimization using -Munroll, -Mvect, and -Mconcur. Loop performance may be

improved through vectorization or unrolling options, and, on systems with multiple processors, by

using parallelization options.

-Munroll unrolls loops. Executing multiple instances during each loop iteration reduces branch

overhead, improving execution speed by creating better opportunities for instruction scheduling.

Using -Munroll sub-options c:number and n:number, or using -Mnounroll can control whether

and how loops are unrolled.

-Mvect option triggers the vectorizer to scan code searching for loops that are candidates for high-

level transformations such as loop distribution, loop exchange, cache tiling, and idiom recognition

(replacement of a recognizable code sequence, such as a reduction loop, with optimized code

sequences or function calls). The vectorizer transformation can be controlled by arguments to the

-Mvect option. By default, -Mvect without sub-options is equivalent to -Mvect=assoc,

cachesize:262144. Vectorization sub-options are assoc, cachesize:number, sse, and prefetch.

-Mconcur option instructs the compiler to scan code searching for loops that are candidates for auto-

parallelization. -Mconcur must be used at compile-time and link-time. The parallelizer performs

various operations that are controlled by arguments to the -Mconcur option. By default, -Mconcur

without sub-options is equivalent to -Mconcur=dist:block. Auto-Parallelization sub-options are

altcode:number, dist:block, dist:cycle, cncall, noassoc, and innermost.

Interprocedural Analysis and Optimization using -Mipa. Interprocedural analysis (IPA) can

improve performance for many programs. To compile programs with IPA use an aggregated

suboption such as -Mipa=fast. Refer to the PGI Compiler User’s Guide for available sub-options.

Function Inlining using -Minline.

Inlining allows a call to a function or subroutine to be

replaced by a copy of the body of that function or subroutine.

Several -Minline sub-options determine

the selection criteria for functions to be inlined. Available sub-options are except:func, name:func,

size:number, levels:number, and lib:filename.ext.

Note that in C++ releases prior to 6.2, function

inlining does not occur unless the -Minline switch is used. Beginning with release 6.2 inlining will

occur automatically for C++ functions specified by means of the

inline keyword or methods defined

in the body of the class. Also, if C++ exceptions are not used, the --no_exceptions flag improves

performance.

3.1.4 Linking with ACML

Due to the strategic importance of the AMD multi-core processor architecture, libraries are in place to

assist developers in porting software to AMD processors. AMD Core Math Library (ACML) is

designed to “squeeze” the greatest possible performance from AMD multi-core platforms and is

integrated in all PGI Toolkits. As the number of cores increases over time, future processor