Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
79
8.3.5 Fault Domain Repair
One of the most complex features of high availability systems is the need to repair failed fault
domains while the system continues to operate. To support this, the specific capabilities required in
the hardware are dependent on the design of the fault domains. Because of the complexity, it is not
unusual that the requirement for on-line repair of fault domains is a major driver in the system
architecture, and the identification of fault domains in the first place.
In general, the more different sorts of fault domains a system has (i.e., the more the system design
is a fault tolerant one vs. a clustering one), the more specific hardware capabilities will be required.
The capabilities needed to support on-line repair have several dimensions. These include:
the basic ability to hot-swap fault domains
notification to other parts of the system (e.g., the operating system and management
middleware) that a system configuration has changed
guidance to service personnel to help ensure correct repair actions
on-line firmware upgradability
maintenance of a system inventory
A few comments on each of these follow.
Hot-Swap
To support repair actions while the system remains active, the hardware must allow for the physical
removal and replacement of a fault domain while the redundant fault domains which are providing
service remain active. Furthermore, the hardware needs to be constructed to make this operation as
failsafe as possible. Even if repairs are carried out by trained technicians, something as simple as
dropping a screw can cause a system failure if the system is not designed with regards to on-line
repairs from the beginning. Generally, high availability systems are designed to make fault-domain
removal and replacement a very simple and safe operation.
Notification of Configuration Change
When a fault domain is repaired, the hardware platform configuration changes. At the least, a new
resource is made available that the operating system and application program can begin using. This
needs to be communicated to the operating system and/or application so that they can begin using
the newly repaired fault domain.
The minimum requirement in a high availability system is to be able to cope with the removal and
replacement of fault domains within a static configuration. However, it is highly desirable to be
able to modify the system configuration more dynamically, adding additional hardware, upgrading
hardware, etc. While this is not required for failure avoidance, it allows some amount of system
upgrade to be accomplished without having to schedule system outages.
Guidance to Help Ensure Correct Repair Actions
One of the most common causes of computer system failures is erroneous operator actions.
Performing on-line repair actions opens opportunities for errors. Therefore, a critical capability of
fault managed systems is to guide repair actions to help prevent errors. This includes having
common, standardized presentations of system alarms, designing the system to permit very simple
repair procedures, and intuitive guidance through the repair itself.