HP-UX HB v13.00 Ch-21 - Itanium

HP-UX Handbook – Rev 13.00 Page 5 (of 35)

Chapter 21 Itanium Architecture (IA)

October 29, 2013

functional unit may be idle even though there are instructions in the instruction stream

destined for that functional unit.

The concept behind explicit parallelism is that instructions arrive at the processor explicitly

ordered by the compiler. The compiler organizes the code for an entire program and makes

the ordering explicit so the processor can execute instructions in the most efficient manner.

Simpler, smaller chip control structures are possible when parallelism is exposed by the

compiler instead of the hardware. Space saved on the chip can be used for additional

functional units, large numbers of registers, and large caches -further increasing parallelism

and overall performance.

Predication

Another major performance limiter for traditional architectures is branching. A branch is a

decision between two sets of instructions. Today’s architectures use a method called branch

prediction to predict which set of instructions to load. When branches are miss-predicted the

whole path suffers a time delay. While current architectures may only miss-predict 5-10% of

the time, the penalties may slow down the processor by as much as 30-40%. Branches also

constrain compiler efficiency and underutilize the capabilities of the microprocessor.

The new 64 bit ISA uses a concept called predication. Predication effectively executes both

branches, rather than trying to predict the correct branch. When the correct branch is known,

unnecessary results are discarded.

Predication can remove many branches from the code and reduce miss-predicts significantly.

A study in ISCA 1995 by Scott Mahlke and others, demonstrated that predication removed

over 50% of the branches and 40% of the miss-predicted branches from several popular

benchmark programs. Thus, predication enables increased performance resulting from greater

parallelism and better utilization of an Itanium based processor’s performance capabilities.

Speculation

Memory latency (the time to retrieve data from memory) is yet another performance

limitation for traditional architectures. Memory latency stalls the processor, leaving it idle

until the data arrives from memory. Because memory latency has not kept up with increasing

processor speeds, loads (the retrieval of data from memory) need to be initiated earlier to

ensure that data arrives when it is needed.

The new 64-bit ISA uses speculation, a method of allowing the compiler to initiate a load

from memory earlier, even before it is known to be needed, thus ensuring data is available for

use if needed. As a result, the compiler schedules to allow more time for data to arrive without

stalling the processor or slowing its performance.

Because the Itanium ISA allows the compiler to expose maximum parallelism in the code and

explicitly describe it to the hardware, simpler and smaller chip control structures are possible.

Space saved on the chip can then be used for additional resources, such as larger caches and

many more registers and functional units. These, in turn, supply the processor with a steady

stream of instructions and data to make full use of its capabilities, greatly increasing parallel

execution and overall performance.