Family paper

Processor Hot Plug
6
: An entire processor or processor board
can be physically removed, replaced or added without bringing
down the system, either to replace a failed or failing component
or to add resources as workloads grow. Processors or cores can
also be logically added or removed from running partitions, either
to reallocate resources among partitions or to functionally
integrate a spare processor into a running partition.
Advanced Machine Check Architecture: The Advanced Machine
Check Architecture enables coordinated error handling across the
hardware, firmware and operating system (see page 9 and Figure 4
for more information).
Memory RAS Features
Extensive RAS features are integrated to detect and correct errors
throughout the memory subsystem.
7
DRAM Protection: Error Correcting Code (ECC) mechanisms are
implemented to detect and fix errors in attached memory compo-
nents. Both Single Device Data Correction (SDDC) and Double Device
Data Correction (DDDC) are supported. SDDC is strong enough to
correct multi-bit errors
8
in a single DRAM device, to map out a failed
device, and to continue correcting single-bit errors after a device is
mapped out. DDDC is even stronger. It can correct multi-bit errors
in two DRAM devices, map out two failed devices, and continue
correcting single-bit errors after the devices are mapped out.
These mechanisms can improve system uptime and reduce
DIMM replacement rates. There is no performance penalty
for mapping out the devices.
Figure 2. The Intel® Itanium® processor 9300 series provides comprehensive support for error avoidance, detection, correction, containment and
reporting across all major structures.
Reducing Soft Errors in Latches by Up to 100x
One of the most common non-human sources of server error
is caused by naturally occurring high-energy particles striking
nuclei in processors, chipsets and memory components. In
some cases, the energy from these events can cause a logic
gate to switch states, resulting in a “soft” error that can
corrupt data and even bring down an entire server.
As discussed in this paper, the Intel® Itaniuprocessor 9300
series has extensive mechanisms for detecting, correcting
and containing these errors. It also includes new circuit topol-
ogies that dramatically reduce the frequency of soft errors.
Estimates show that the new soft error (SE) hardened latches
are up to 100 times less susceptible to soft errors than stan-
dard latches, and the new SE hardened registers are up to
80 times less susceptible than standard registers.
Figure 3. A random particle strike can change a logic state,
causing a “soft error” that must be detected and corrected.
Scalable
Buffered
Memory
Intel® Itanium®
processor 9300
series
Intel® 7500
chipset
Intel® 7500
chipset
ICH10
Scalable
Buffered
Memory
PCI Express* Gen 2
PCI Express* Gen 2
1. Memory RAS
• DRAM Protection
(ECC, SDDC, DDDC),
• Memory Scrub
(Patrol and Demand)
• Memory Thermal Protection
(CLTT and OLTT)
• Memory Channel Protection
(Retry, Reset, Lane Failover)
• Memory Migration
• DIMM Sparing
• Memory mirroring
• Memory Channel Hot Plug
2. Processor RAS
• ECC and Parity protection
• SE Hardened Latches
and Registers
• Intel® Cache Safe technology
(L2I, L2D, L3, directory cache)
• Processor Hot Plug
• Advanced Machine Check Architecture
with new CMCI support
3. Intel® QuickPath Interconnect RAS
• Detection/Correction (CRC, Retry,
Reset, Lane Failover)
• Hot Plug Links (supports Hot Plug
for I/O Hub and PCIe cards)
• Domain Partitioning
• Intelligent Error Management
1
2
3
Intel® Itanium®
processor
9300 series
Intel® Itanium®
processor 9300
series
Intel® Itanium®
processor 9300
series
n+
n+
S
G
V+
D
V
Time
p substrate
B
7
White Paper: The Intel® Itanium® Processor 9300 Series