Family paper

Intel® QuickPath Interconnect RAS Features
The Intel QuickPath Interconnect offers extensive, multi-level protec-
tions against hard and soft errors to support very high levels of data
reliability and system availability for processor-to-processor and
processor-to-I/O Hub communications.
Error Detection and Correction: Error protection in the interconnect
subsystem works very much like the progressive memory channel
protections described above: 1) CRC is used to detect errors; 2)
transactions can be retried multiple times; 3) the channel can be
physically reset; and 4) bad lanes can be mapped out. Although
mapping out lanes may impact performance by reducing a full-
width link to half-width or a half-width link to quarter-width, it
does enable uninterrupted performance and it does protect
against most multi-bit hard errors.
Clock Failover: In the event of a clock failure, clocks can be redirected
to one of two failover clock lanes to enable uninterrupted operation.
Hot Pluggable Interconnect Links: Interconnect links can be put
in an electrically idle state without bringing down the system, which
enables a component on the other side of the link, such as an I/O Hub
or another processor, to be physically replaced. This capability can
also be used to create a hard partition between connected compo-
nents and to dynamically reconfigure partition boundaries during
runtime to prevent downtime and to utilize available resources
more efficiently.
Intelligent Error Management: Intel QuickPath Interconnect
Technology goes beyond error detection and correction. It also
provides information on the type, scope and source of an error.
For example, a single dropped or lost packet often leads to a cascade
of lost packets resulting in a large number of transaction timeouts.
By sorting and analyzing the dropped packets, Intel QuickPath
Interconnect Technology helps to identify the initial source of
the error. The firmware and OS can then use this information to
recover in the least disruptive manner. Intel QuickPath Interconnect
Technology also provides mechanisms to ensure that errors do not
corrupt non-volatile storage, such as a hard drive.
Advanced Machine Check Architecture
Many of the RAS mechanisms discussed above are supported entirely
in hardware. Others require support from the firmware or the OS. The
Intel® Itanium® microarchitecture implements an Advanced Machine
Check Architecture that coordinates error handling across all these
levels, using well-defined interfaces that enable server vendors to
integrate and extend RAS capabilities in their system designs and
management applications. This sophisticated error management
greatly reduces the likelihood of data corruption. It also improves
the reliability of the system, since it enables hardware to work in
conjunction with system firmware and the OS to recover from
otherwise uncorrectable errors (Figure 4).
The Intel Itanium processor 9300 series includes a number of
enhancements to the Machine Check Architecture. For example,
Corrected Machine Check Interrupts (CMCI) reporting capabilities
have been added to the memory and interconnect subsystems.
CMCI enables error reporting through the local processor interface
which helps to localize errors faster and provides a foundation for
building predictive failure solutions.
Normal status
with error prevention
Extensive SE resilient
circuits reduces risks
of encountering
most errors
System recovery
System works in
conjunction with FW
and OS to recover or
restart processes to
continue normal
operation
System reconfiguration
Isolate/replace defective
hardware or add new
resources without
bringing down a system
to enable a true fault
resilient system
Error detected
Minimizes or eliminates
error escapes with
robust data checking
features on internal
paths and on
interconnect links
Error correction
Detected errors that
can be fixed are
corrected to quickly
resume normal operation
Error containment
On more severe errors,
errant datum and its
propagated errors are
quarantined and
logged for analysis
Figure 4. The extensive error detection and correction mechanisms in the Intel® Itanium® processor 9300 series, combined with its Advanced Machine
Check Architecture, enable comprehensive error management to optimize both system uptime and data integrity.
9
White Paper: The Intel® Itanium® Processor 9300 Series