Specifications

Reliability, availability, and serviceability technologies

Processor reliability, availability, and serviceability (RAS) improvements

Itanium processor 9500 series extends the mainframe-class RAS features from previous Itanium processors. The

Itanium processor 9500 series incorporate extensive capabilities for detecting, correcting, and reporting processor soft

and hard errors.

Major core structure improvements include:

• Soft errors: High-energy particles striking a processor may cause a logic gate to switch state, resulting in a “soft”

error. Itanium processor 9500 series circuit topologies were designed to improve resistance to soft errors in latches

from any regular latch. Registers are also less susceptible than standard registers to soft errors.

• ECC or parity: All major structures on the Itanium processor 9500 series are protected through ECC or parity error

protection. End to end parity protection with recovery support is featured on all critical internal buses and data paths.

• Intel Cache Safe technology: Heuristics are used to monitor the number of errors per cache index and map out bad

cache lines. Cache data is also automatically scrubbed to correct single bit errors. Itanium processor 9500 series

protect the second and third-level cache arrays. Previous Itanium processors only protected the third-level cache.

• Advanced Machine Check Architecture (AMCA): This enables coordinated error handling across the hardware,

firmware, and OSs. The coordinated handling greatly reduces the likelihood of data corruption. It also improves the

reliability of the system as firmware and OS participate on the system recovery, from otherwise uncorrectable errors.

Memory RAS features

Extensive RAS features are integrated to detect and correct errors on the memory subsystem.

• DRAM ECC: By using memory DIMMs whose base DRAM is x4 bits wide, the subsystem corrects single device data

correction (SDDC) and double device data correction (DDDC). This means that the memory subsystem can map out two

failed devices and continue correcting single bit errors. There is no performance penalty for mapping out the devices.

• Memory scrubbing: Accumulated memory DIMM errors can result in multibit errors that cannot be corrected and can

result in data corruption. Memory scrubbing finds memory errors before they accumulate. Corrected data is rewritten

back to the appropriate memory location.

• SMI Memory Channel Protection: Cyclic Redundancy Check (CRC) is used to detect errors in the SMI channels. Upon

errors, the transactions are retried several times. If required, the channel could be reinitialized on demand. If the

problem persists, the affected memory channel is mapped out.

Intel QPI RAS features

Extensive RAS features are integrated to detect and correct errors on the memory subsystem.

• Error detection and correction: CRC is used to detect errors—transactions can be retried multiple times, the channel

can be physically reset on the fly by the link layer, and bad lanes can be failed over.

• Clock failover: In the event of a clock failure, clocks can be redirected to one of two failover clock lanes to enable

uninterrupted operation.

• Lane failover: During operation, failed lanes would cause CRCs that would trigger a “on the fly” link retraining where

the bad lane are mapped. Operations are resumed with a reduced width link. Although mapping out lanes may affect

the performance by reducing a full-width link to half or half-to-quarter, it does enable uninterrupted operation and

protection against most multibit hard errors.