Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
45
Management (CM) services. The operator uses the services of the CM to view the health of each
component, determine which component has failed, and identify the component (field replaceable
unit, FRU) so that it can be replaced. Once the component has been replaced, the CM service is
involved in re-establishing the system model (redundancies, etc.).
Health status may be reported to the manager by the individual components (in the form of event
notifications) or they may be ascertained via a query mechanism from the manager. The manager
may also use testing and diagnostic techniques to determine the health status of an individual
component (or a set of components). These techniques are discussed in Section 6.2 and
Section 6.5.
As system components move from the healthy condition to the degraded condition where, in the
degraded condition, an abnormality has been detected, but the component is still delivering an
acceptable service level — the system manager can anticipate a system failure in the future.
Several options are available to the system manager at this point, depending on the severity of the
degradation and the policies and procedures, which are in place. Some of these options are:
Off-loading work from the degraded system component to a more healthy one
Notifying a system operator
Performing fault management recovery and repair measures (refer to Section 6.0)
Implicit in this set is the capability of system components to report (via event notifications or in
response to a query) their health status. Also implicit in this set is the capability of a system
manager to receive (via event notifications or queries) this information and process it accordingly.
Monitoring State Changes
Similar to monitoring heath status, it is also important that the CM service monitors changes that
occur in the state of any component. The component states of active/standby as well as the current
operational state need to be recorded. Depending on the application, the CM service may also keep
track of what data object an application is processing, and how far through that processing the
application is. This type of data may also (or alternatively) be stored by an application that is in the
standby mode for the current application.
Testing and Diagnostics
The CM service has the responsibility, though the system model, of understanding which
diagnostic tests can be run on a component. Additionally, it can determine which tests can be run
while the system or component is being used by the system and which must be run only when the
component is not being used. The CM service provides this information to applications and
management middleware which can then run the diagnostics.
In many very high reliability systems, diagnostics are run on components that are not in service to
ensure that they will be ready if needed. Additionally, some systems may run diagnostics
periodically on components, while the are either in use or in standby, to check that they are
working properly.
5.4.5 Techniques
Industry Standard Methods. SNMP, RMON, CIM, TMN(OSI CMIP), IPMI and CORBA are
several of the industry standards available for reporting information external to the system. These
methods may also be used for communicating information between the management function and
the system components. A further discussion of these terms is in Section 5.5.3.