Specifications

ManualsBrandsHP ManualsComputer equipmentrp8400

high availability

High availability (HA) is the hallmark of HP computer systems. But HP knows that delivering

solutions that fully enable the “always-on,” 24 x 7 operations demanded of today’s businesses

requires more than just delivering laundry lists of unusable HA features—or HA features with

limited utility. The high availability features of the HP Server rp8400 actually address the real

causes of customer downtime, as determined by actual field data from midrange computer users.

The HA features of the rp8400 can be classified as those that address per-partition reliability

and those that address intra-partition reliability—that is, single points of failure between hard

partitions.

partition reliability

The rp8400 has a design that is significantly “hardened” over other systems in its class. In fact,

many of the features in this midrange system can only be found in mainframes (or HP Superdome).

The reliability features within each rp8400 partition have been field-proven to provide high system

reliability. And many customers who have taken advantage of these features report significantly

lower hardware failure rates than with competitive systems.

CPU protection

The central processing unit is often a major cause of system downtime. For instance, CPU cache

errors are demonstrated to be a large contributor (in many cases, the greatest contributor) to

unplanned system downtime. Furthermore, addition or modification of CPU resources is among the

highest-ranking causes of planned hardware downtime. But in the rp8400, HP has designed

specific features to combat CPU-caused downtime, including:

• full error checking and correcting (ECC) on all caches

• automatic deconfiguration of “faulty” CPUs—known as dynamic processor resilience (DPR)

• a highly effective and reliable CPU cooling scheme

• CPU “hot-spares” using HP’s instant capacity on demand (iCOD)

• redundant CPU power converters

ECC on caches

The CPU caches in the rp8400 are fully protected from single-bit hard errors and random soft

errors generated from cosmic rays or other intermittent error-generation sources. Some competitive

systems in the same class are not similarly protected, resulting in errors that are hard to debug

and that are in many cases blamed on the customer environment. Such cache errors in these

unprotected systems can result in failures that bring down multiple partitions.

Another advantage of the rp8400’s CPU cache is its layout, which significantly reduces the

chance of a multi-bit error due to a random cosmic ray strike. Such attention to detail is not found

in many designs available from other vendors.

automatic CPU deconfiguration

Dynamic processor resilience (DPR) refers to the ability of the system to detect and de-allocate

CPUs that are generating an excessive quantity of recoverable cache errors. This protects the

customer against the extremely unlikely event of a double-bit cache error, preventing the error

from occurring and causing downtime.

Here’s how DPR works:

1. Processor detects single-bit error in data cache and vectors to processor-dependent code (PDC).

2. PDC generates a low-priority machine check (LPMC).

3. LPMC handler logs information to diag2 driver.

4. Diaglogd daemon pulls LPMC log information from diag2 and passes it to the HP Event

Monitoring Service (EMS) LPMC monitor.

5. If there have been too many LPMCs within 24 hours, CPU is de-allocated (online). If iCOD

machine, online replacement is found.

6. System firmware is called to have PDC disable the processor the next time the system boots.

7. Event is generated to notify customer and HP.

This functionality is currently available for all CPUs in a partition except for the Monarch CPU.

(The Monarch processor refers to one processor that is selected during system boot and given

special boot and interrupt responsibilities.) Although the Monarch CPU will continue to correct

cache errors “on the fly,” it is not de-allocated until the next reboot. A future operating system

release will allow DPR of the Monarch processor.