Technical information

IBM Europe, Middle East, and Africa Hardware
Announcement ZG14-0098
IBM is a registered trademark of International Business Machines Corporation
12
errors reported to the hypervisor, which supports operating system deallocation of a
page associated with a hard single-cell fault.
Mutual surveillance
The service processor monitors the operation of the firmware during the boot
process and also monitors the hypervisor for termination. The hypervisor monitors
the service processor and reports service reference code when it detects surveillance
loss. In the PowerVM environment, it will perform a reset/reload if it detects the loss
of the service processor.
Environmental monitoring functions
The Power Systems family does ambient and over temperature monitoring and
reporting.
Availability enhancement functions
The Power Systems family continues to offer and introduce significant enhancements
designed to increase system availability.
POWER8 processor functions
As in POWER6®, POWER7, and POWER7+
TM
, the POWER8 processor has the ability
to do processor instruction retry for some transient errors and alternate processor
recovery for a number of core-related faults. This significantly reduces exposure to
both hard (logic) and soft (transient) errors in the processor core. Soft failures in the
processor core are transient (intermittent) errors, often due to cosmic rays or other
sources of radiation, and generally are not repeatable. When an error is encountered
in the core, the POWER8 processor will first automatically retry the instruction. If the
source of the error was truly transient, the instruction will succeed and the system
will continue as before. On IBM systems prior to POWER6, this error would have
caused a checkstop.
Hard failures are more difficult, being true logical errors that will be replicated
each time the instruction is repeated. Retrying the instruction will not help in this
situation. As in POWER6, POWER7, and POWER7+ technology, processors have the
ability to extract the failing instruction from the faulty core and retry it elsewhere
in the system for a number of faults, after which the failing core is dynamically
deconfigured and called out for replacement in the PowerVM environment. These
features are designed to avoid a full system outage.
As in POWER6 and POWER7+, the POWER8 processor includes single processor
check stopping for certain faults that cannot be handled by the availability
enhancements described in the preceding section. This significantly reduces the
probability of any one processor affecting total system availability.
Partition availability priority
Also available is the ability to assign availability priorities to partitions. In the
PowerVM environment if an alternate processor recovery event requires spare
processor resources in order to protect a workload, when no other means of
obtaining the spare resources is available, the system will determine which partition
has the lowest priority and attempt to claim the needed resource. On a properly
configured POWER8 processor-based server, this allows that capacity to be first
obtained from, for example, a test partition instead of a financial accounting system.
Cache availability
The L2 and L3 caches in the POWER8 processor and L4 cache in the memory buffer
chip are protected with double-bit detect, single-bit correct error detection code
(ECC). In addition, a threshold of correctable errors detected on cache lines can
result in the data in the cache lines being purged and the cache lines removed from
further operation without requiring a reboot in PowerVM environment. In addition,
the L3 cache has the ability to dynamically substitute a spare bit-line for a faulty bit-
lane, allowing an entire faulty "column" of cache, impacting multiple cache lines, to