Optimizing Itanium-Based Applications (May 2011)
16
-exec
Asserts that code is being compiled for an executable. Similar to -Bprotected_def, all locally
defined symbols are marked as having protected export class. Additionally, accesses to symbols known
to be defined in the executable can be materialized with absolute addressing, rather than linkage table
accesses.
-minshared
Equivalent to -Bprotected -exec. When building an executable that makes minimal use of shared
libraries, use this option to obtain fastest access sequences to non-shared library code and data.
controlling other optimization features
+Odata_prefetch=[none|direct|indirect] (default +Odata_prefetch=indirect)
+O[no]data_prefetch (+Odata_prefetch = +Odata_prefetch=indirect)
Enables data prefetch insertion. Currently, data prefetches are inserted for loops containing inductive
accesses, or certain linked-list traversals. With +Odata_prefetch=direct, prefetches are inserted
for loads and stores that have inductive addresses and are on heavily-executed paths through the loop.
The prefetches are inserted to cover the longest latency possible given the size of the outstanding
request queues in the cache hierarchy and the expected memory latency, and are given the appropriate
cache hint for the data type being accessed. The compiler attempts to minimize the overhead of
prefetching using a number of techniques, which might involve unrolling the loop or utilizing rotating
registers to share a single static prefetch among multiple arrays. By default, with +Odata_prefetch
or +Odata_prefetch=indirect, in addition to the prefetches inserted by
+Odata_prefetch=direct, the compiler inserts prefetches for data that is accessed with an
address that is indirectly dependent on an induction expression in the loop. In other words, the induction
expression is fed through some other intermediate computation to build the data address. Currently, the
types of intermediate computation supported are loads and bit extracts. For example, in the following
code array A is accessed indirectly using the index loaded from array B:
for (i=0;i<n;i++)
read A[B[i]];
The direct prefetching algorithm would insert prefetches for array B, which has an inductive address.
With indirect prefetching, the compiler detects that array A is accessed indirectly with B[i], and
inserts prefetches appropriately. In order to compute the prefetch address for A, array B is speculatively
loaded. If the prefetch distance is PF, then indirect prefetching inserts the following code into the above
loop:
lfetch B[i+PF*2]
index = ld.s B[i+PF]
(p) lfetch A[index]
Notice that array B is now prefetched at twice the normal prefetch distance, because we need to
speculatively load it at the prefetch distance in order to prefetch array A at the prefetch distance. A
speculative load is used because we can run past the end of array B, and we do not want the load of A’s
prefetch address to raise any exceptions. Also notice that we may predicate the indirect prefetch, to
avoid executing it the last PF-1 iterations of the loop. Because the speculative load may access an
address that is off the end of the B array, the index used in the indirect prefetch may be junk, potentially
resulting in DTLB misses on the indirect prefetch if we executed it. The accesses to B in the last PF-1
iterations are not likely to result in DTLB misses, since they lie just after the B array in the address
space, particularly when utilizing large pages.
In certain cases, for small arrays, the compiler may decide to insert a number of straight-line prefetches
before the loop to prefetch the entire array, rather than inserting inductive or indirect prefetches into the
loop body.