Optimizing Itanium-Based Applications (May 2011)
Optimizing Itanium-Based Applications
17
With profile data, the compiler may also insert stride prefetches for linked-list traversals that have
regular runtime address strides. Consider the following source code example:
for (p = ptr; p != 0; p = p->next)
x += p->data;
Normally, the compiler cannot insert prefetches for later iterations of the loop without dereferencing
successive values of the next field. However, profile data may indicate that the values of the p pointer
have a regular address stride in virtual memory. For example, if the values of p on successive iterations
are {8, 16, 24, 32, …}, then it has a regular stride of 8 bytes. The compiler can then insert a prefetch
using this stride to prefetch later iterations:
for (p = ptr; p != 0; p = p->next) {
x += p->data;
lfetch p + PF*8;
}
In some cases, profile data may indicate that there are multiple dominant strides across the program’s
execution. In that case, the compiler may insert a prefetch using a runtime computation of the stride,
such that the stride used in the current iteration’s prefetch is the stride between the values of the pointer
in the last two successive iterations.
Without profile data indicating a regular stride for a linked-list traversal, the compiler will insert a
prefetch of the next field’s pointer. For the above example, it would insert the following prefetch:
for (p = ptr; p != 0; p = p->next) {
lfetch p->next->next;
x += p->data;
}
If the loop is reasonably large, this can help hide some of the latency from the subsequent iteration’s
dereference of p.
+Oprefetch_latency=n
Indicates that data prefetches in loops should hide n cycles of memory latency. By default, the compiler
attempts to issue prefetches far enough ahead to just fill the L2 cache outstanding request queue or
cover the expected memory latency. Using this option will override that heuristic, and cause prefetches
to be inserted enough iterations ahead of the corresponding load to cover the n cycles.
+O[no]inline:filename
+O[no]inline=symlist
#pragma no_inline
#pragma inline
#pragma [no]inline_call
Enable or disable inlining for specific functions. The functions can be listed in either a separate file
filename or on the command-line in symlist. By default, the compiler uses heuristics to determine
the profitability of inlining candidates, but these heuristics are overridden by this option. This option
can be used when the user knows that inlining of a certain function is always profitable, or never
profitable. The no_inline pragma can also be used to list those functions that should never be
inlined, and the inline pragma to list those that should always be inlined. Place the appropriate
pragma in the source file that contains the definition of the function that should or should not be inlined.
The [no]inline_call pragma is used to enable or disable inlining of a particular call site. It takes
no arguments and affects the outermost, leftmost call in the next statement. However, the
[no]inline_call pragma is not implemented at first release.