Specifications

ManualsBrandsADLINK Technology ManualsComputer equipmentPCI-8213

181

182

183

184

185

186

187

188

189

190

Chapter 4. Continuous availability and manageability 167

Sometimes an uncorrectable error is temporary in nature and occurs in data that can be

recovered from another repository. For example:

򐂰 Data in the instruction L1 cache is never modified within the cache itself. Therefore, an

uncorrectable error discovered in the cache is treated like an ordinary cache-miss, and

correct data is loaded from the L2 cache.

򐂰 The L2 and L3 cache of the POWER7 processor-based systems can hold an unmodified

copy of data in a portion of main memory. In this case, an uncorrectable error simply

triggers a reload of a cache line from main memory.

In cases where the data cannot be recovered from another source, a technique called Special

Uncorrectable Error (SUE) handling is used to prevent an uncorrectable error in memory or

cache from immediately causing the system to terminate. Instead, the system tags the data

and determines whether it can ever be used again.

򐂰 If the error is irrelevant, it does not force a checkstop.

򐂰 If the data is used, termination can be limited to the program, kernel, or hypervisor owning

the data, or a freezing of the I/O adapters that are controlled by an I/O hub controller if

data is to be transferred to an I/O device.

When an uncorrectable error is detected, the system modifies the associated ECC word,

thereby signaling to the rest of the system that the

standard ECC is no longer valid. The

service processor is then notified and takes appropriate actions. When running AIX V5.2 (or

later) or Linux, and a process attempts to use the data, the operating system is informed of

the error and might terminate, or only terminate a specific process associated with the corrupt

data, depending on the operating system and firmware level and whether the data was

associated with a kernel or non-kernel process.

Only when the corrupt data is being used by the POWER Hypervisor can the entire system be

rebooted, thereby preserving overall system integrity. If Active Memory Mirroring is enabled,

the entire system is protected and continues to run.

Depending on the system configuration and the source of the data, errors encountered during

I/O operations might not result in a machine check. Instead, the incorrect data is handled by

the PCI host bridge (PHB) chip. When the PHB chip detects a problem, it rejects the data,

preventing data from being written to the I/O device. The PHB then enters a freeze mode,

halting normal operations. Depending on the model and type of I/O being used, the freeze

can include the entire PHB chip, or simply a single bridge, resulting in the loss of all I/O

operations that use the frozen hardware until a power-on reset of the PHB. The impact to

partitions depends on how the I/O is configured for redundancy. In a server that is configured

for fail-over availability, redundant adapters spanning multiple PHB chips can enable the

system to recover transparently, without partition loss.

4.2.7 PCI enhanced error handling

IBM estimates that PCI adapters can account for a significant portion of the hardware-based

errors on a large server. Although servers that rely on boot-time diagnostics can identify

failing components to be replaced by hot-swap and reconfiguration, runtime errors pose a