Family paper

Memory Scrubbing (Patrol and Demand): The protections
described above are activated only when a memory location is
read. However, errors can occur to data or instructions in memory
locations that are not accessed. If these errors accumulate, they can
result in multi-bit errors that cannot be corrected and could result in
data corruption and even system failure. Memory scrubbing in the
Intel Itanium processor 9300 series employs an integrated hardware
engine to find and correct memory errors before they accumulate.
Scrubbing is performed periodically and automatically on all popu-
lated memory locations (Patrol Scrubbing). In addition, when errors
are discovered for data in transit, the corrected data is rewritten
back to the appropriate memory location (Demand Scrubbing).
Memory Thermal Protection: Overheating of memory components
can cause or accelerate component failure. The Intel Itanium proces-
sor 9300 series supports two mechanisms for throttling commands
issued to the memory channels to protect against overheating. Closed
loop thermal throttling (CLTT) is triggered by a thermal sensor in the
DIMM that sends a signal to the memory controller. Open loop thermal
throttling (OLTT) is triggered when the rate of memory commands
per DIMM exceed a configurable limit for a configurable time window.
Alternatively, the firmware can be configured to increase system fan
speed in response to these same triggers.
Memory Channel Protection: A Cyclic Redundancy Check (CRC)
mechanism is used to detect errors in the memory channels. When
an error is detected, a series of progressively stronger corrective
actions are triggered: 1) the transaction is retried, several times
if necessary, which corrects most soft errors; 2) the memory
channel is physically reset (reinitialized), which corrects most
persistent errors; 3) if the problem persists, the affected lane
on the memory channel is mapped out. This corrects hard
errors without degrading performance.
Memory Migration and DIMM Sparing: An algorithm in system
firmware continuously monitors memory errors. If it determines
that a memory component is failing, a hardware engine can copy
the contents of the failing component to another location. This
process can be completely transparent to the OS. Two mechanisms
are supported. With DIMM sparing, the contents of the failing DIMM
are copied to a spare DIMM on the same memory channel. With
memory migration, the contents are copied to the memory of any
other memory controller on the system. DIMM sparing requires less
memory overhead (just a single spare DIMM per memory channel).
However, memory migration enables hot-swap capabilities for an
entire memory card and can help to support hot-swap functionality
for processors.
Memory Mirroring: The Intel Itanium processor 9300 series can be
configured to automatically maintain a backup copy of main memory.
If a failure is detected, the correct data can be accessed from the
backup. Since the probability of simultaneous errors in parallel memory
locations on two different memory DIMMs is extremely small, this
provides exceptionally strong protection against memory errors.
However, it does require that the system be configured with twice
the memory capacity to support the backup. If memory capacity is
an issue, memory mirroring can be configured only for selected
memory controllers.
Memory Channel Hot Plug: Memory channels can be put in an
electrically idle state. This enables IT personnel to logically reallocate
memory resources among running partitions. It also enables them
to physically add or remove memory riser cards without bringing
down the system. With these capabilities, memory upgrades, faulty
memory card replacements, and resource management can all be
performed without downtime.
8
White Paper: The Intel® Itanium® Processor 9300 Series