Specifications

ManualsBrandsADLINK Technology ManualsComputer equipmentPCI-8213

171

172

173

174

175

176

177

178

179

180

Chapter 4. Continuous availability and manageability 159

If there are no CoD processor cores available system-wide, total processor capacity is

lowered below the licensed number of cores.

Single processor checkstop

As in POWER6, POWER7 provides single-processor check-stopping for certain processor

logic, command, or control errors that cannot be handled by the availability enhancements in

the preceding section.

This way significantly reduces the probability of any one processor affecting total system

availability by containing most processor checkstops to the partition that was using the

processor at the time that the full checkstop goes into effect.

Even with all these availability enhancements to prevent processor errors from affecting

system-wide availability, errors might result on a system-wide outage.

4.2.3 Memory protection

A memory protection architecture that provides good error resilience for a relatively small L1

cache might be very inadequate for protecting the much larger system main store. Therefore,

a variety of protection methods is used in POWER processor-based systems to avoid

uncorrectable errors in memory.

Memory protection plans must take into account many factors, including:

򐂰 Size

򐂰 Desired performance

򐂰 Memory array manufacturing characteristics

POWER7 processor-based systems have a number of protection schemes designed to

prevent, protect, or limit the effect of errors in main memory. These capabilities include:

򐂰 64-byte ECC code

This innovative ECC algorithm from IBM research allows a full 8-bit device kill to be

corrected dynamically. This ECC code mechanism works on DIMM pairs on a rank basis.

(Depending on the size, a DIMM might have one, two, or four ranks.) With this ECC code,

an entirely bad DRAM chip can be marked as bad (chip mark). After marking the DRAM

as bad, the code corrects all the errors in the bad DRAM. It can additionally mark a 2-bit

symbol as bad and correct the 2-bit symbol, providing a double-error detect or single-error

correct ECC, or a better level of protection in addition to the detection or correction of a

chipkill event.

This improvement in the ECC word algorithm replaces the redundant bit steering used on

POWER6 systems.

The Power 770 and 780, and future POWER7 high-end machines, have a spare DRAM

chip per rank on each DIMM that can be spared out. Effectively, this protection means that

on a rank basis, a DIMM pair can detect and correct two and sometimes three chipkill

events and still provide better protection than ECC, explained in the previous paragraph.

򐂰 Hardware scrubbing

Hardware scrubbing is a method used to deal with intermittent errors. IBM POWER

processor-based systems periodically address all memory locations. Any memory

locations with a correctable error are rewritten with the correct data.