Specifications
high availability
High availability (HA) is the hallmark of HP computer systems. But HP knows that delivering
solutions that fully enable the “always-on,” 24 x 7 operations demanded of today’s businesses
requires more than just delivering laundry lists of unusable HA features—or HA features with
limited utility. The high availability features of the HP Server rp8400 actually address the real
causes of customer downtime, as determined by actual field data from midrange computer users.
The HA features of the rp8400 can be classified as those that address per-partition reliability
and those that address intra-partition reliability—that is, single points of failure between hard
partitions.
partition reliability
The rp8400 has a design that is significantly “hardened” over other systems in its class. In fact,
many of the features in this midrange system can only be found in mainframes (or HP Superdome).
The reliability features within each rp8400 partition have been field-proven to provide high system
reliability. And many customers who have taken advantage of these features report significantly
lower hardware failure rates than with competitive systems.
CPU protection
The central processing unit is often a major cause of system downtime. For instance, CPU cache
errors are demonstrated to be a large contributor (in many cases, the greatest contributor) to
unplanned system downtime. Furthermore, addition or modification of CPU resources is among the
highest-ranking causes of planned hardware downtime. But in the rp8400, HP has designed
specific features to combat CPU-caused downtime, including:
• full error checking and correcting (ECC) on all caches
• automatic deconfiguration of “faulty” CPUs—known as dynamic processor resilience (DPR)
• a highly effective and reliable CPU cooling scheme
• CPU “hot-spares” using HP’s instant capacity on demand (iCOD)
• redundant CPU power converters
ECC on caches
The CPU caches in the rp8400 are fully protected from single-bit hard errors and random soft
errors generated from cosmic rays or other intermittent error-generation sources. Some competitive
systems in the same class are not similarly protected, resulting in errors that are hard to debug
and that are in many cases blamed on the customer environment. Such cache errors in these
unprotected systems can result in failures that bring down multiple partitions.
Another advantage of the rp8400’s CPU cache is its layout, which significantly reduces the
chance of a multi-bit error due to a random cosmic ray strike. Such attention to detail is not found
in many designs available from other vendors.
automatic CPU deconfiguration
Dynamic processor resilience (DPR) refers to the ability of the system to detect and de-allocate
CPUs that are generating an excessive quantity of recoverable cache errors. This protects the
customer against the extremely unlikely event of a double-bit cache error, preventing the error
from occurring and causing downtime.
Here’s how DPR works:
1. Processor detects single-bit error in data cache and vectors to processor-dependent code (PDC).
2. PDC generates a low-priority machine check (LPMC).
3. LPMC handler logs information to diag2 driver.
4. Diaglogd daemon pulls LPMC log information from diag2 and passes it to the HP Event
Monitoring Service (EMS) LPMC monitor.
5. If there have been too many LPMCs within 24 hours, CPU is de-allocated (online). If iCOD
machine, online replacement is found.
6. System firmware is called to have PDC disable the processor the next time the system boots.
7. Event is generated to notify customer and HP.
This functionality is currently available for all CPUs in a partition except for the Monarch CPU.
(The Monarch processor refers to one processor that is selected during system boot and given
special boot and interrupt responsibilities.) Although the Monarch CPU will continue to correct
cache errors “on the fly,” it is not de-allocated until the next reboot. A future operating system
release will allow DPR of the Monarch processor.
27










