Providing Open Architecture High Availability Solutions

The HA framework should provide a set of APIs for the HA-aware applications to interact with the

HA middleware. These APIs include:

• HA middleware registration

• HA event notification

• Notification of HA middleware of application state changes

• Data replication (control and data)

4.4 Availability Requirements

A system’s availability requirements are normally stated in terms of the availability of the service it

provides to the users of the system. Since many of the target applications for the Open HA

Framework are the systems required for basic communication infrastructure services, many of

these requirements are quite demanding. It is simply not adequate to have the service down while a

system is repaired. Nor is it acceptable to have the service down while planned maintenance or

upgrades are performed a system.

In many cases market forces set the level of acceptable availability. However, in some regulated

areas such as the telephone network, there are some legislated standards imposed with

consequential penalties for failure to meet the specified service levels. For example a widely

adopted standard for the telephone network has been 5-nines, or service availability 99.999%

percent of the time. This means that the system can only be down for a maximum of about 5

minutes a year.

Generally, if a complete system is composed of a number of critical subsystems, then each of the

subsystems must actually achieve a much higher level of availability to meet the complete system

availability. Similarly if a subsystem can be downed by a number of different causes, then the total

amount of downtime must be spread between these different causes. These causes include:

• Hardware failure

• Operating system failure

• Application failure

• Application load and congestion

• Operator error

• Environmental problems (power failure, fire, earthquake, etc.)

• Planned downtime for system upgrade or change

4.4.1 Recovery Times

The allocation of available downtime between these kinds of causes is clearly an application and

implementation specific task. However some of the consequences of this allocation usually are:

• The allocated time for hardware and operating system failure is generally a small portion of

the 5 minutes per year, and is often 1 minute or less in a 5-nines system.

• The total recovery time (from failure, detection, reconfiguration, and restart) available for each

failure depends of how often a failure occurs in a year, but can be as low as 1 to 5 seconds. The

frequency of outages is often recorded and used as a measure of system un-availability.

• The HA design of the systems components must consider (and usually provide for) a rolling

change and upgrade strategy.