Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
36
The HA framework should provide a set of APIs for the HA-aware applications to interact with the
HA middleware. These APIs include:
HA middleware registration
HA event notification
Notification of HA middleware of application state changes
Data replication (control and data)
4.4 Availability Requirements
A system’s availability requirements are normally stated in terms of the availability of the service it
provides to the users of the system. Since many of the target applications for the Open HA
Framework are the systems required for basic communication infrastructure services, many of
these requirements are quite demanding. It is simply not adequate to have the service down while a
system is repaired. Nor is it acceptable to have the service down while planned maintenance or
upgrades are performed a system.
In many cases market forces set the level of acceptable availability. However, in some regulated
areas such as the telephone network, there are some legislated standards imposed with
consequential penalties for failure to meet the specified service levels. For example a widely
adopted standard for the telephone network has been 5-nines, or service availability 99.999%
percent of the time. This means that the system can only be down for a maximum of about 5
minutes a year.
Generally, if a complete system is composed of a number of critical subsystems, then each of the
subsystems must actually achieve a much higher level of availability to meet the complete system
availability. Similarly if a subsystem can be downed by a number of different causes, then the total
amount of downtime must be spread between these different causes. These causes include:
Hardware failure
Operating system failure
Application failure
Application load and congestion
Operator error
Environmental problems (power failure, fire, earthquake, etc.)
Planned downtime for system upgrade or change
4.4.1 Recovery Times
The allocation of available downtime between these kinds of causes is clearly an application and
implementation specific task. However some of the consequences of this allocation usually are:
The allocated time for hardware and operating system failure is generally a small portion of
the 5 minutes per year, and is often 1 minute or less in a 5-nines system.
The total recovery time (from failure, detection, reconfiguration, and restart) available for each
failure depends of how often a failure occurs in a year, but can be as low as 1 to 5 seconds. The
frequency of outages is often recorded and used as a measure of system un-availability.
The HA design of the systems components must consider (and usually provide for) a rolling
change and upgrade strategy.