Providing Open Architecture High Availability Solutions

ordered to execute a system reset operation. If this still does not clear the problem, it may be

ordered to power itself off. Similarly, if a particular I/O controller has failed in a system, it may be

ordered to isolate itself from the I/O bus. If this does not work, a second level of isolation may be to

isolate a slot on the backplane, or even an entire I/O bus segment (at the point of a PCI to PCI

bridge, for example).

8.3.4 Fault Domain Failure Recovery

Fault Domain Failure Recovery means having the system recover from the failure of a fault

domain; that is, to continue to provide its required service while no longer using the services of the

failed fault domain.

The basic hardware capability needed is the ability to be reconfigured as required for the system to

continue to provide its services using just the remaining functional fault domains. The specific

requirements of the hardware will be dependent on details of the system redundancy strategies

employed. It is possible that no special hardware capabilities will be needed to support recovery

actions.

Examples of the sorts of hardware capabilities which may be needed are:

• Reprogramming a spare Ethernet controller MAC address so it can assume the role of a failed

controller

• Reprogramming a switched fabric routing network to bypass failed nodes

• Switching a PCI to PCI bridge from non-transparent to transparent mode so that a slave

processor card can become a system controller

• Selectively enabling or disabling hardware components

• Triggering a hardware condition which in turn triggers an automatic failover to a functioning

fault domain

• Reassigning a boot device and forcing a reboot to load a specific configuration

As systems recover from failures of fault domains, a common problem is the preservation and/or

migration of system state information in failed system components. In some cases, the duplication

of system state information between hardware fault domains may be a function of the hardware

platform itself (via, for example, a shadow memory capability).

More often, system state information is transferred to backup fault domains via management

middleware and/or application software. Even in these cases hardware may support this operation

by providing specific capabilities to perform hot, warm, or cold restarts. These designations

suggest how state information in a system is preserved or destroyed as a result of a system reset

operation. While there are no common formal definitions of these terms, a typical set of

capabilities a particular system might provide could be:

Hot Restart. System reset operation clears and masks all interrupts, sets program counter and all

CPU registers to pre-determined values, does not alter any system RAM, and begins operation.

Warm Restart. System reset operation clears and masks all interrupts, sets program counter and all

CPU registers to pre-determined values (perhaps different from hot restart), does not alter certain

parts of system RAM containing in-memory databases, and begins software reboot operation.

Cold Restart. System reset operation clears and masks all interrupts, sets program counter and all

CPU registers to pre-determined values (perhaps different from hot or warm restart), clears all

system RAM, and begins software boot operation.