HP-UX HB v13.00 Ch-21 - Itanium
HP-UX Handbook – Rev 13.00 Page 5 (of 35)
Chapter 21 Itanium Architecture (IA)
October 29, 2013
functional unit may be idle even though there are instructions in the instruction stream
destined for that functional unit.
The concept behind explicit parallelism is that instructions arrive at the processor explicitly
ordered by the compiler. The compiler organizes the code for an entire program and makes
the ordering explicit so the processor can execute instructions in the most efficient manner.
Simpler, smaller chip control structures are possible when parallelism is exposed by the
compiler instead of the hardware. Space saved on the chip can be used for additional
functional units, large numbers of registers, and large caches -further increasing parallelism
and overall performance.
Predication
Another major performance limiter for traditional architectures is branching. A branch is a
decision between two sets of instructions. Today’s architectures use a method called branch
prediction to predict which set of instructions to load. When branches are miss-predicted the
whole path suffers a time delay. While current architectures may only miss-predict 5-10% of
the time, the penalties may slow down the processor by as much as 30-40%. Branches also
constrain compiler efficiency and underutilize the capabilities of the microprocessor.
The new 64 bit ISA uses a concept called predication. Predication effectively executes both
branches, rather than trying to predict the correct branch. When the correct branch is known,
unnecessary results are discarded.
Predication can remove many branches from the code and reduce miss-predicts significantly.
A study in ISCA 1995 by Scott Mahlke and others, demonstrated that predication removed
over 50% of the branches and 40% of the miss-predicted branches from several popular
benchmark programs. Thus, predication enables increased performance resulting from greater
parallelism and better utilization of an Itanium based processor’s performance capabilities.
Speculation
Memory latency (the time to retrieve data from memory) is yet another performance
limitation for traditional architectures. Memory latency stalls the processor, leaving it idle
until the data arrives from memory. Because memory latency has not kept up with increasing
processor speeds, loads (the retrieval of data from memory) need to be initiated earlier to
ensure that data arrives when it is needed.
The new 64-bit ISA uses speculation, a method of allowing the compiler to initiate a load
from memory earlier, even before it is known to be needed, thus ensuring data is available for
use if needed. As a result, the compiler schedules to allow more time for data to arrive without
stalling the processor or slowing its performance.
Because the Itanium ISA allows the compiler to expose maximum parallelism in the code and
explicitly describe it to the hardware, simpler and smaller chip control structures are possible.
Space saved on the chip can then be used for additional resources, such as larger caches and
many more registers and functional units. These, in turn, supply the processor with a steady
stream of instructions and data to make full use of its capabilities, greatly increasing parallel
execution and overall performance.