Applications paper

Improve Uptime for Your Most Critical Applications
In addition to providing major performance gains, this new
processor family delivers substantial improvements in reliabil-
ity, availability, and serviceability (RAS) to support even higher
levels of data integrity and system uptime. Hardware-based error
prevention, detection, and correction are enhanced and ex-
tended throughout the platform. Improvements in rmware add
to these advantages, providing expanded coverage of potential
error events, along with improved logging for higher availability,
faster recovery, and better support for predictive failure. These
capabilities work in conjunction with Intel Itanium processors’
complete machine check architecture, which coordinates error
handling across hardware, rmware, and operating systems to
enable extremely high availability and data integrity.
Advanced Error Correction throughout the Platform
All silicon-based computer chips are vulnerable to ordinary
background radiation. An alpha particle can change the value of
data in a register or array. Electrical noise and variations in power
supplies can have similar impacts (although they rarely do). The
longer the data is held, the greater the chance that it will be
modied by one of these transient events, resulting in a “soft
error.” There are many possible design strategies for dealing with
soft errors. The best hardware designs automatically detect and
correct for common classes of soft errors to improve data integ-
rity and system availability without requiring rmware, operating
system (OS), or application intervention.
The Intel Itanium processor family incorporates extensive fea-
tures for automatically detecting and correcting soft errors at
the hardware level. For example:
Errors in large caches and arrays are automatically detected and
corrected using error correcting code (ECC).
Errors in smaller caches and various buffers and arrays are de-
tected using parity bits. These transient errors can then be
corrected using various forms of “trying again,” which simply
means returning to a state prior to the error event and then
proceeding as if the error had not occurred.
Errors in pipelines are detected using residues, which are cal-
culated during mathematical operations, or using parity bits,
which move along with data and instructions in the pipeline.
When transient errors are detected in a pipeline, they can
also be corrected by trying again. The mechanisms are similar
to those used for correcting errors in smaller caches, buffers,
and arrays.
Next-Generation RAS with Intel® Instruction Replay
Technology
1
The next-generation Intel Itanium processor family provides
enhanced support for soft error detection and correction
throughout the platform. One of the most important new RAS
features is Intel Instruction Replay Technology. This technology
provides exceptionally fast recovery from soft errors in one
of the most performance-critical areas of the processor: the
instruction pipeline. In order to understand how Intel Instruc-
tion Replay functions, it is rst necessary to understand how
the pipeline itself works.
Understanding normal (error-free) pipeline execution
Intel Itanium processors have a memory hierarchy of caches,
buffers, and registers that hold the data waiting to be pro-
cessed and the program instructions waiting to be executed.
Software programs held in main memory are executed by
bringing the needed portions of the program into the proces-
sor’s caches. From there, the instructions are moved into
buffers and sent down pipelines to be executed. Data moves in
a similar fashion, from main memory, to caches, to buffers, and
nally to registers, at which point specic instructions act on
specic data.
2