Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
78
ordered to execute a system reset operation. If this still does not clear the problem, it may be
ordered to power itself off. Similarly, if a particular I/O controller has failed in a system, it may be
ordered to isolate itself from the I/O bus. If this does not work, a second level of isolation may be to
isolate a slot on the backplane, or even an entire I/O bus segment (at the point of a PCI to PCI
bridge, for example).
8.3.4 Fault Domain Failure Recovery
Fault Domain Failure Recovery means having the system recover from the failure of a fault
domain; that is, to continue to provide its required service while no longer using the services of the
failed fault domain.
The basic hardware capability needed is the ability to be reconfigured as required for the system to
continue to provide its services using just the remaining functional fault domains. The specific
requirements of the hardware will be dependent on details of the system redundancy strategies
employed. It is possible that no special hardware capabilities will be needed to support recovery
actions.
Examples of the sorts of hardware capabilities which may be needed are:
Reprogramming a spare Ethernet controller MAC address so it can assume the role of a failed
controller
Reprogramming a switched fabric routing network to bypass failed nodes
Switching a PCI to PCI bridge from non-transparent to transparent mode so that a slave
processor card can become a system controller
Selectively enabling or disabling hardware components
Triggering a hardware condition which in turn triggers an automatic failover to a functioning
fault domain
Reassigning a boot device and forcing a reboot to load a specific configuration
As systems recover from failures of fault domains, a common problem is the preservation and/or
migration of system state information in failed system components. In some cases, the duplication
of system state information between hardware fault domains may be a function of the hardware
platform itself (via, for example, a shadow memory capability).
More often, system state information is transferred to backup fault domains via management
middleware and/or application software. Even in these cases hardware may support this operation
by providing specific capabilities to perform hot, warm, or cold restarts. These designations
suggest how state information in a system is preserved or destroyed as a result of a system reset
operation. While there are no common formal definitions of these terms, a typical set of
capabilities a particular system might provide could be:
Hot Restart. System reset operation clears and masks all interrupts, sets program counter and all
CPU registers to pre-determined values, does not alter any system RAM, and begins operation.
Warm Restart. System reset operation clears and masks all interrupts, sets program counter and all
CPU registers to pre-determined values (perhaps different from hot restart), does not alter certain
parts of system RAM containing in-memory databases, and begins software reboot operation.
Cold Restart. System reset operation clears and masks all interrupts, sets program counter and all
CPU registers to pre-determined values (perhaps different from hot or warm restart), clears all
system RAM, and begins software boot operation.