Specifications
Chapter 4. Continuous availability and manageability 159
If there are no CoD processor cores available system-wide, total processor capacity is
lowered below the licensed number of cores.
Single processor checkstop
As in POWER6, POWER7 provides single-processor check-stopping for certain processor
logic, command, or control errors that cannot be handled by the availability enhancements in
the preceding section.
This way significantly reduces the probability of any one processor affecting total system
availability by containing most processor checkstops to the partition that was using the
processor at the time that the full checkstop goes into effect.
Even with all these availability enhancements to prevent processor errors from affecting
system-wide availability, errors might result on a system-wide outage.
4.2.3 Memory protection
A memory protection architecture that provides good error resilience for a relatively small L1
cache might be very inadequate for protecting the much larger system main store. Therefore,
a variety of protection methods is used in POWER processor-based systems to avoid
uncorrectable errors in memory.
Memory protection plans must take into account many factors, including:
Size
Desired performance
Memory array manufacturing characteristics
POWER7 processor-based systems have a number of protection schemes designed to
prevent, protect, or limit the effect of errors in main memory. These capabilities include:
64-byte ECC code
This innovative ECC algorithm from IBM research allows a full 8-bit device kill to be
corrected dynamically. This ECC code mechanism works on DIMM pairs on a rank basis.
(Depending on the size, a DIMM might have one, two, or four ranks.) With this ECC code,
an entirely bad DRAM chip can be marked as bad (chip mark). After marking the DRAM
as bad, the code corrects all the errors in the bad DRAM. It can additionally mark a 2-bit
symbol as bad and correct the 2-bit symbol, providing a double-error detect or single-error
correct ECC, or a better level of protection in addition to the detection or correction of a
chipkill event.
This improvement in the ECC word algorithm replaces the redundant bit steering used on
POWER6 systems.
The Power 770 and 780, and future POWER7 high-end machines, have a spare DRAM
chip per rank on each DIMM that can be spared out. Effectively, this protection means that
on a rank basis, a DIMM pair can detect and correct two and sometimes three chipkill
events and still provide better protection than ECC, explained in the previous paragraph.
Hardware scrubbing
Hardware scrubbing is a method used to deal with intermittent errors. IBM POWER
processor-based systems periodically address all memory locations. Any memory
locations with a correctable error are rewritten with the correct data.