User`s guide

3-28
ESCALA T610 and PL 600T Users Guide
Configuring and Deconfiguring Processors or Memory
All failures that crash the system with a machine check or check stop, even if intermittent,
are reported as a diagnostic callout for service repair. To prevent the recurrence of
intermittent problems and improve the availability of the system until a scheduled
maintenance window, processors and memory DIMMs with a failure history are marked
”bad” to prevent their being configured on subsequent boots.
A processor or memory DIMM is marked ”bad” under the following circumstances:
A processor or memory DIMM fails built–in self test (BIST) or power–on self test (POST)
testing during boot (as determined by the Service Processor).
A processor or memory DIMM causes a machine check or check stop during runtime,
and the failure can be isolated specifically to that processor or memory DIMM (as
determined by the processor runtime diagnostics in the Service Processor).
A processor or memory DIMM reaches a threshold of recovered failures that results in a
predictive callout (as determined by the processor runtime diagnostics in the Service
Processor).
During boot time, the Service Processor does not configure processors or memory DIMMs
that are marked ”bad” .
If a processor or memory DIMM is deconfigured, the processor or memory DIMM remains
offline for subsequent reboots until it is replaced or Repeat Gard is disabled. The Repeat
Gard function also allows users to manually deconfigure a processor or memory DIMM, or
re–enable a previously deconfigured processor or memory DIMM. For information on
configuring or deconfiguring a processor, see the “Processor Configuration/Deconfiguration
Menu” on page 3-13. For information on configuring or deconfiguring a memory DIMM, see
the “Memory Configuration/Deconfiguration Menu” on page 3-14. Both of these are
submenus under the System Information Menu.
You can enable or disable CPU Repeat Gard or Memory Repeat Gard using the Processor
Configuration/Deconfiguration Menu, which is a submenu under the System Information
Menu.
Run–Time CPU Deconfiguration (CPU Gard)
L1 instruction cache recoverable errors, L1 data cache correctable errors, and L2 cache
correctable errors are monitored by the processor runtime diagnostics (PRD) code running
in the Service Processor. When a predefined error threshold is met, an error log with
warning severity and threshold exceeded status is returned to AIX. At the same time, PRD
marks the CPU for deconfiguration at the next boot. AIX will attempt to migrate all resources
associated with that processor to another processor and then stop the defective processor.