Optimizing Itanium-Based Applications (May 2011)

Optimizing Itanium-Based Applications

With profile data, the compiler may also insert stride prefetches for linked-list traversals that have

regular runtime address strides. Consider the following source code example:

for (p = ptr; p != 0; p = p->next)

x += p->data;

Normally, the compiler cannot insert prefetches for later iterations of the loop without dereferencing

successive values of the next field. However, profile data may indicate that the values of the p pointer

have a regular address stride in virtual memory. For example, if the values of p on successive iterations

are {8, 16, 24, 32, …}, then it has a regular stride of 8 bytes. The compiler can then insert a prefetch

using this stride to prefetch later iterations:

for (p = ptr; p != 0; p = p->next) {

x += p->data;

lfetch p + PF*8;

}

In some cases, profile data may indicate that there are multiple dominant strides across the program’s

execution. In that case, the compiler may insert a prefetch using a runtime computation of the stride,

such that the stride used in the current iteration’s prefetch is the stride between the values of the pointer

in the last two successive iterations.

Without profile data indicating a regular stride for a linked-list traversal, the compiler will insert a

prefetch of the next field’s pointer. For the above example, it would insert the following prefetch:

for (p = ptr; p != 0; p = p->next) {

lfetch p->next->next;

x += p->data;

}

If the loop is reasonably large, this can help hide some of the latency from the subsequent iteration’s

dereference of p.

+Oprefetch_latency=n

Indicates that data prefetches in loops should hide n cycles of memory latency. By default, the compiler

attempts to issue prefetches far enough ahead to just fill the L2 cache outstanding request queue or

cover the expected memory latency. Using this option will override that heuristic, and cause prefetches

to be inserted enough iterations ahead of the corresponding load to cover the n cycles.

+O[no]inline:filename

+O[no]inline=symlist

#pragma no_inline

#pragma inline

#pragma [no]inline_call

Enable or disable inlining for specific functions. The functions can be listed in either a separate file

filename or on the command-line in symlist. By default, the compiler uses heuristics to determine

the profitability of inlining candidates, but these heuristics are overridden by this option. This option

can be used when the user knows that inlining of a certain function is always profitable, or never

profitable. The no_inline pragma can also be used to list those functions that should never be

inlined, and the inline pragma to list those that should always be inlined. Place the appropriate

pragma in the source file that contains the definition of the function that should or should not be inlined.

The [no]inline_call pragma is used to enable or disable inlining of a particular call site. It takes

no arguments and affects the outermost, leftmost call in the next statement. However, the

[no]inline_call pragma is not implemented at first release.