Specifications
Chapter 4. Continuous availability and manageability 167
Sometimes an uncorrectable error is temporary in nature and occurs in data that can be
recovered from another repository. For example:
Data in the instruction L1 cache is never modified within the cache itself. Therefore, an
uncorrectable error discovered in the cache is treated like an ordinary cache-miss, and
correct data is loaded from the L2 cache.
The L2 and L3 cache of the POWER7 processor-based systems can hold an unmodified
copy of data in a portion of main memory. In this case, an uncorrectable error simply
triggers a reload of a cache line from main memory.
In cases where the data cannot be recovered from another source, a technique called Special
Uncorrectable Error (SUE) handling is used to prevent an uncorrectable error in memory or
cache from immediately causing the system to terminate. Instead, the system tags the data
and determines whether it can ever be used again.
If the error is irrelevant, it does not force a checkstop.
If the data is used, termination can be limited to the program, kernel, or hypervisor owning
the data, or a freezing of the I/O adapters that are controlled by an I/O hub controller if
data is to be transferred to an I/O device.
When an uncorrectable error is detected, the system modifies the associated ECC word,
thereby signaling to the rest of the system that the
standard ECC is no longer valid. The
service processor is then notified and takes appropriate actions. When running AIX V5.2 (or
later) or Linux, and a process attempts to use the data, the operating system is informed of
the error and might terminate, or only terminate a specific process associated with the corrupt
data, depending on the operating system and firmware level and whether the data was
associated with a kernel or non-kernel process.
Only when the corrupt data is being used by the POWER Hypervisor can the entire system be
rebooted, thereby preserving overall system integrity. If Active Memory Mirroring is enabled,
the entire system is protected and continues to run.
Depending on the system configuration and the source of the data, errors encountered during
I/O operations might not result in a machine check. Instead, the incorrect data is handled by
the PCI host bridge (PHB) chip. When the PHB chip detects a problem, it rejects the data,
preventing data from being written to the I/O device. The PHB then enters a freeze mode,
halting normal operations. Depending on the model and type of I/O being used, the freeze
can include the entire PHB chip, or simply a single bridge, resulting in the loss of all I/O
operations that use the frozen hardware until a power-on reset of the PHB. The impact to
partitions depends on how the I/O is configured for redundancy. In a server that is configured
for fail-over availability, redundant adapters spanning multiple PHB chips can enable the
system to recover transparently, without partition loss.
4.2.7 PCI enhanced error handling
IBM estimates that PCI adapters can account for a significant portion of the hardware-based
errors on a large server. Although servers that rely on boot-time diagnostics can identify
failing components to be replaced by hot-swap and reconfiguration, runtime errors pose a
more significant problem.
PCI adapters are generally complex designs involving extensive on-board instruction
processing, often on embedded microcontrollers. They tend to use industry standard grade
components with an emphasis on product cost that is relative to high reliability. In certain
cases, they might be more likely to encounter internal microcode errors or many of the
hardware errors described for the rest of the server.