Specifications
158 IBM Power 770 and 780 Technical Overview and Introduction
Persistent deallocation
To enhance system availability, a component that is identified for deallocation or
deconfiguration on a POWER processor-based system is flagged for persistent deallocation.
Component removal can occur either dynamically (while the system is running) or at boot
time (IPL), depending both on the type of fault and when the fault is detected.
In addition, runtime unrecoverable hardware faults can be deconfigured from the system after
the first occurrence. The system can be rebooted immediately after failure and resume
operation on the remaining stable hardware. This way prevents the same faulty hardware
from affecting system operation again. The repair action is deferred to a more convenient,
less critical time.
Persistent deallocation functions include:
Processor
L2/L3 cache lines (cache lines are dynamically deleted)
Memory
Deconfigure or bypass failing I/O adapters
Processor instruction retry
As in POWER6, the POWER7 processor has the ability to retry processor instruction and
alternate processor recovery for a number of core related faults. This ability significantly
reduces exposure to both permanent and intermittent errors in the processor core.
Intermittent errors, often because of cosmic rays or other sources of radiation, are generally
not repeatable.
With this function, when an error is encountered in the core, in caches and certain logic
functions, the POWER7 processor first automatically retries the instruction. If the source of
the error was truly transient, the instruction succeeds and the system continues as before.
On IBM systems prior to POWER6, this error caused a checkstop.
Alternate processor retry
Hard failures are more difficult, being permanent errors that are replicated each time that the
instruction is repeated. Retrying the instruction does not help in this situation because the
instruction will continue to fail.
As in POWER6, POWER7 processors have the ability to extract the failing instruction from
the faulty core and retry it elsewhere in the system for a number of faults, after which the
failing core is dynamically deconfigured and scheduled for replacement.
Dynamic processor deallocation
Dynamic processor deallocation enables automatic deconfiguration of processor cores when
patterns of recoverable core-related faults are detected. Dynamic processor deallocation
prevents a recoverable error from escalating to an unrecoverable system error, which might
otherwise result in an unscheduled server outage. Dynamic processor deallocation relies on
the service processor’s ability to use FFDC-generated recoverable error information to notify
the POWER Hypervisor when a processor core reaches its predefined error limit. Then the
POWER Hypervisor dynamically deconfigures the failing core and is called out for
replacement. The entire process is transparent to the partition owning the failing instruction.
If there are available inactivated processor cores or CoD processor cores, the system
effectively puts a CoD processor into operation after an activated processor is determined to
no longer be operational. In this way, the server remains with its total processor power.