Optimizing Itanium-Based Applications (May 2011)

-exec

Asserts that code is being compiled for an executable. Similar to -Bprotected_def, all locally

defined symbols are marked as having protected export class. Additionally, accesses to symbols known

to be defined in the executable can be materialized with absolute addressing, rather than linkage table

accesses.

-minshared

Equivalent to -Bprotected -exec. When building an executable that makes minimal use of shared

libraries, use this option to obtain fastest access sequences to non-shared library code and data.

controlling other optimization features

+Odata_prefetch=[none|direct|indirect] (default +Odata_prefetch=indirect)

+O[no]data_prefetch (+Odata_prefetch = +Odata_prefetch=indirect)

Enables data prefetch insertion. Currently, data prefetches are inserted for loops containing inductive

accesses, or certain linked-list traversals. With +Odata_prefetch=direct, prefetches are inserted

for loads and stores that have inductive addresses and are on heavily-executed paths through the loop.

The prefetches are inserted to cover the longest latency possible given the size of the outstanding

request queues in the cache hierarchy and the expected memory latency, and are given the appropriate

cache hint for the data type being accessed. The compiler attempts to minimize the overhead of

prefetching using a number of techniques, which might involve unrolling the loop or utilizing rotating

registers to share a single static prefetch among multiple arrays. By default, with +Odata_prefetch

or +Odata_prefetch=indirect, in addition to the prefetches inserted by

+Odata_prefetch=direct, the compiler inserts prefetches for data that is accessed with an

address that is indirectly dependent on an induction expression in the loop. In other words, the induction

expression is fed through some other intermediate computation to build the data address. Currently, the

types of intermediate computation supported are loads and bit extracts. For example, in the following

code array A is accessed indirectly using the index loaded from array B:

for (i=0;i<n;i++)

read A[B[i]];

The direct prefetching algorithm would insert prefetches for array B, which has an inductive address.

With indirect prefetching, the compiler detects that array A is accessed indirectly with B[i], and

inserts prefetches appropriately. In order to compute the prefetch address for A, array B is speculatively

loaded. If the prefetch distance is PF, then indirect prefetching inserts the following code into the above

loop:

lfetch B[i+PF*2]

index = ld.s B[i+PF]

(p) lfetch A[index]

Notice that array B is now prefetched at twice the normal prefetch distance, because we need to

speculatively load it at the prefetch distance in order to prefetch array A at the prefetch distance. A

speculative load is used because we can run past the end of array B, and we do not want the load of A’s

prefetch address to raise any exceptions. Also notice that we may predicate the indirect prefetch, to

avoid executing it the last PF-1 iterations of the loop. Because the speculative load may access an

address that is off the end of the B array, the index used in the indirect prefetch may be junk, potentially

resulting in DTLB misses on the indirect prefetch if we executed it. The accesses to B in the last PF-1

iterations are not likely to result in DTLB misses, since they lie just after the B array in the address

space, particularly when utilizing large pages.

In certain cases, for small arrays, the compiler may decide to insert a number of straight-line prefetches

before the loop to prefetch the entire array, rather than inserting inductive or indirect prefetches into the

loop body.