Specifications
Configuring and Deconfiguring Processors or Memory
All failures that crash the system with a machine check or check stop, even if
intermittent, are reported as a diagnostic callout for service repair. To prevent the
recurrence of intermittent problems and improve the availability of the system until a
scheduled maintenance window, processors and memory DIMMs with a failure history
are marked ″bad″ to prevent their being configured on subsequent boots.
A processor or memory DIMM is marked ″bad″ under the following circumstances:
v A processor or memory DIMM fails built-in self-test (BIST) or power-on self-test
(POST) during boot (as determined by the service processor).
v A processor or memory DIMM causes a machine check or check stop during run
time, and the failure can be isolated specifically to that processor or memory DIMM
(as determined by the processor run-time diagnostics in the service processor).
v A processor or memory DIMM reaches a threshold of recovered failures that results
in a predictive callout (as determined by the processor run-time diagnostics in the
service processor).
During boot time, the service processor does not configure processors or memory
DIMMs that are marked ″bad.″
If a processor or memory DIMM is deconfigured, the processor or memory DIMM
remains offline for subsequent reboots until it is replaced or Repeat Gard is disabled.
The Repeat Gard function also allows users to manually deconfigure a processor or
memory DIMM, or re-enable a previously deconfigured processor or memory DIMM. For
information on configuring or deconfiguring a processor, see the Processor
Configuration/Deconfiguration Menu on page 259. For information on configuring or
deconfiguring a memory DIMM, see the Memory Configuration/Deconfiguration Menu on
page 260. Both of these are submenus under the System Information Menu.
You can enable or disable CPU Repeat Gard or Memory Repeat Gard using the
Processor Configuration/Deconfiguration Menu, which is a submenu under the System
Information Menu.
Run-Time CPU Deconfiguration (CPU Gard)
L1 instruction cache recoverable errors, L1 data cache correctable errors, and L2 cache
correctable errors are monitored by the processor run time diagnostics (PRD) code
running in the service processor. When a predefined error threshold is met, an error log
entry with warning severity and threshold exceeded status is returned to AIX. At the
same time, PRD marks the CPU for deconfiguration at the next boot. AIX will attempt to
migrate all resources associated with that processor to another processor and then stop
the defective processor.
Chapter 7. Using the Service Processor 277