Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
42
• Verifying Required System Component Population (System Model). This technique
determines what components should be present in the system to provide a particular service. It
works from a defined system model, detecting which components in the system model are
present, and if those components are functional. Interdependencies among the components are
also tracked and analyzed.
• Obtaining Detailed Information about System Components (FRU). This technique allows
the system to obtain specification information about each component. Commonly referred to
as field replaceable unit (FRU) information, it includes such items as: class, product,
manufacturer, and revision. FRU information of system components is necessary for the repair
step in fault management. The capability to obtain FRU information can reduce the MTTR by
allowing a service technician to correctly identify replacement components before physically
inspecting the component. FRU information can also be used during hot-swap events to
determine if the component is needed in the system and if the appropriate supporting
components are available (a device coupled with a device driver). Hot-swapped components
can be left isolated during the time in which additional components are discovered.
• Establishing System Configuration. In establishing the system configuration, the system
components are assembled to establish fault domains, necessary to provide the appropriate
level of redundancy. Once all of the system components have been detected and identified,
system configuration can establish relationships between components to establish redundancy.
• Data Collection. Information is collected and consolidated within the system model in order
to monitor the health and state information of each component in the system. This information
is available both locally and remotely.
• Role Assignment. When redundant components are grouped together to provide a service,
they are assigned various policies such as N+1 or 2N. To support a specific policy, components
are assigned operational roles including active, standby and spare. In the event of a component
failure, these roles may be reassigned to maintain service. For example, if an active component
fails, the standby component may be reassigned the role of active.
5.4 Interfaces to System Components
5.4.1 Introduction
Health status and state information from each system component is communicated to other system
components within a system. This is referred to as intra-system communication. This information
may also be reported externally to the system. When reported externally, it may be reported in a
raw form, indicating the health status and state of each component, or it may be summarized to
indicate the aggregate health of the system as a whole.
System components have states and need to be managed and controlled in a coordinated fashion.
While these components may vary widely in type and function, an HA system must be able to 1)
access the components’ attributes to obtain state and configuration information, and 2) control
these components via methods for reconfiguration or fault management.
There are several service availability functions both within the system (i.e., management
middleware) and external to the system (i.e., network management system) that need to access and
control the various system components, including much of the fault management functions
discussed in Section 6.0. These functions need to be able to:
• Manage faults
• Monitor performance