Specifications
11
Reliability, availability, and serviceability technologies
Processor reliability, availability, and serviceability (RAS) improvements
Itanium processor 9500 series extends the mainframe-class RAS features from previous Itanium processors. The
Itanium processor 9500 series incorporate extensive capabilities for detecting, correcting, and reporting processor soft
and hard errors.
Major core structure improvements include:
• Soft errors: High-energy particles striking a processor may cause a logic gate to switch state, resulting in a “soft”
error. Itanium processor 9500 series circuit topologies were designed to improve resistance to soft errors in latches
from any regular latch. Registers are also less susceptible than standard registers to soft errors.
• ECC or parity: All major structures on the Itanium processor 9500 series are protected through ECC or parity error
protection. End to end parity protection with recovery support is featured on all critical internal buses and data paths.
• Intel Cache Safe technology: Heuristics are used to monitor the number of errors per cache index and map out bad
cache lines. Cache data is also automatically scrubbed to correct single bit errors. Itanium processor 9500 series
protect the second and third-level cache arrays. Previous Itanium processors only protected the third-level cache.
• Advanced Machine Check Architecture (AMCA): This enables coordinated error handling across the hardware,
firmware, and OSs. The coordinated handling greatly reduces the likelihood of data corruption. It also improves the
reliability of the system as firmware and OS participate on the system recovery, from otherwise uncorrectable errors.
Memory RAS features
Extensive RAS features are integrated to detect and correct errors on the memory subsystem.
• DRAM ECC: By using memory DIMMs whose base DRAM is x4 bits wide, the subsystem corrects single device data
correction (SDDC) and double device data correction (DDDC). This means that the memory subsystem can map out two
failed devices and continue correcting single bit errors. There is no performance penalty for mapping out the devices.
• Memory scrubbing: Accumulated memory DIMM errors can result in multibit errors that cannot be corrected and can
result in data corruption. Memory scrubbing finds memory errors before they accumulate. Corrected data is rewritten
back to the appropriate memory location.
• SMI Memory Channel Protection: Cyclic Redundancy Check (CRC) is used to detect errors in the SMI channels. Upon
errors, the transactions are retried several times. If required, the channel could be reinitialized on demand. If the
problem persists, the affected memory channel is mapped out.
Intel QPI RAS features
Extensive RAS features are integrated to detect and correct errors on the memory subsystem.
• Error detection and correction: CRC is used to detect errors—transactions can be retried multiple times, the channel
can be physically reset on the fly by the link layer, and bad lanes can be failed over.
• Clock failover: In the event of a clock failure, clocks can be redirected to one of two failover clock lanes to enable
uninterrupted operation.
• Lane failover: During operation, failed lanes would cause CRCs that would trigger a “on the fly” link retraining where
the bad lane are mapped. Operations are resumed with a reduced width link. Although mapping out lanes may affect
the performance by reducing a full-width link to half or half-to-quarter, it does enable uninterrupted operation and
protection against most multibit hard errors.