Providing Open Architecture High Availability Solutions

• Verifying Required System Component Population (System Model). This technique

determines what components should be present in the system to provide a particular service. It

works from a defined system model, detecting which components in the system model are

present, and if those components are functional. Interdependencies among the components are

also tracked and analyzed.

• Obtaining Detailed Information about System Components (FRU). This technique allows

the system to obtain specification information about each component. Commonly referred to

as field replaceable unit (FRU) information, it includes such items as: class, product,

manufacturer, and revision. FRU information of system components is necessary for the repair

step in fault management. The capability to obtain FRU information can reduce the MTTR by

allowing a service technician to correctly identify replacement components before physically

inspecting the component. FRU information can also be used during hot-swap events to

determine if the component is needed in the system and if the appropriate supporting

components are available (a device coupled with a device driver). Hot-swapped components

can be left isolated during the time in which additional components are discovered.

• Establishing System Configuration. In establishing the system configuration, the system

components are assembled to establish fault domains, necessary to provide the appropriate

level of redundancy. Once all of the system components have been detected and identified,

system configuration can establish relationships between components to establish redundancy.

• Data Collection. Information is collected and consolidated within the system model in order

to monitor the health and state information of each component in the system. This information

is available both locally and remotely.

• Role Assignment. When redundant components are grouped together to provide a service,

they are assigned various policies such as N+1 or 2N. To support a specific policy, components

are assigned operational roles including active, standby and spare. In the event of a component

failure, these roles may be reassigned to maintain service. For example, if an active component

fails, the standby component may be reassigned the role of active.

5.4 Interfaces to System Components

5.4.1 Introduction

Health status and state information from each system component is communicated to other system

components within a system. This is referred to as intra-system communication. This information

may also be reported externally to the system. When reported externally, it may be reported in a

raw form, indicating the health status and state of each component, or it may be summarized to

indicate the aggregate health of the system as a whole.

System components have states and need to be managed and controlled in a coordinated fashion.

While these components may vary widely in type and function, an HA system must be able to 1)

access the components’ attributes to obtain state and configuration information, and 2) control

these components via methods for reconfiguration or fault management.

There are several service availability functions both within the system (i.e., management

middleware) and external to the system (i.e., network management system) that need to access and

control the various system components, including much of the fault management functions

discussed in Section 6.0. These functions need to be able to:

• Manage faults

• Monitor performance