Technical information

ManualsBrandsElta ManualsClock Radio4224

IBM Europe, Middle East, and Africa Hardware

Announcement ZG14-0098

IBM is a registered trademark of International Business Machines Corporation

be repaired. An ECC uncorrectable error detected in these caches can also trigger a

purge and delete of cache lines. This results in no loss of operation if the cache lines

contained data unmodified from what was stored in system memory.

Modified data would be handled through Special Uncorrectable Error handling. L1

data and instruction caches also have a retry capability for intermittent errors and a

cache set delete mechanism for handling solid failures.

Special Uncorrectable Error handling

Special Uncorrectable Error (SUE) handling prevents an uncorrectable error in

memory or cache from immediately causing the system to terminate. Rather, the

system tags the data and determines whether it will ever be used again. If the error

is irrelevant, it will not force a check stop. If the data is used, termination may be

limited to the program/kernel or hypervisor owning the data; or the I/O adapters

controlled by an I/O hub controller would freeze if data were transferred to an I/O

device.

PCI extended error handling

PCI extended error handling (EEH)-enabled adapters respond to a special data

packet generated from the affected PCI slot hardware by calling system firmware,

which will examine the affected bus, allow the device driver to reset it, and continue

without a system reboot. For Linux, EEH support extends to the majority of

frequently used devices, although some third-party PCI devices may not provide

native EEH support.

Predictive failure and dynamic component deallocation

Servers with Power processors have long had the capability to perform predictive

failure analysis on certain critical components such as processors and memory.

When these components exhibit certain symptoms that may indicate a failure is

imminent, the system can dynamically deallocate and call home, when enabled,

about the failing part before the error is propagated system-wide. In many cases,

the system will first attempt to reallocate resources in such a way that will avoid

unplanned outages. In the event that insufficient resources exist to maintain full

system availability, these servers will attempt to maintain partition availability by

user-defined priority.

Uncorrectable error recovery

When the auto-restart option is enabled, the system can automatically restart

following an unrecoverable software error, hardware failure, or environmentally

induced (ac power) failure.

Serviceability

The purpose of serviceability is to efficiently repair the system while attempting to

minimize or eliminate impact to system operation. Serviceability includes system

installation, MES (system upgrades/downgrades), and system maintenance/repair.

Depending upon the system and warranty contract, service may be performed by

the customer, an IBM representative, or an authorized warranty service provider.

The serviceability features delivered in this system provide a highly efficient service

environment by incorporating the following attributes:

• Design for Customer Set Up (CSU), Customer Installed Features (CIF), and

Customer Replaceable Units (CRU)

• Detection and Fault Isolation (ED/FI)

• First Failure Data Capture (FFDC)

• Lightpath service indicators:

– Service labels and service diagrams available on the system and delivered

through IBM Knowledge Center

– Step-by-step service procedures documented in IBM Knowledge Center or

available through the Hardware Management Console