Providing Open Architecture High Availability Solutions

3.0 High Availability Concepts and Principles

The demand for increasingly capable hardware and software systems has grown dramatically over

the past two decades. Advanced, complex, hardware and software systems have a significant

presence in our everyday lives. We often take for granted the tasks and services performed and

delivered by our automobiles, telephones, banking institutions, computers, and the Internet. Nearly

every industry relies on the availability of large, complex, hardware and software systems to

perform important roles that range from enhancing productivity in the work place, to billing

customers for services, to ensuring the safety of others, to providing the technology that advances

our relentless search for innovation. It seems that when things are running smoothly, we hardly

notice these advanced systems in our everyday activities. Yet, when these systems fail to perform

their expected functions, they get our immediate attention. A system failure could result in just an

inconvenience, but some system failures result in loss of revenue and, at the worst, loss of life.

As our dependency on complex hardware and software systems increases, so does the risk and

liability that naturally comes with a potential for failure. The explosive growth of software

capabilities in recent years has generally eclipsed the industry’s ability to effectively design, test,

and deploy these complex systems to the levels of confidence that consumers are demanding. The

problem is further aggravated in software systems because they often also carry the additional

burden of responsibility for masking hardware failures.

This section explores the fundamental principles of engineering HA systems and brings an

understanding of why the design, development and deployment of highly available systems are

such a challenge. This section will also briefly cover some modeling techniques that will help build

a quantitative understanding of an otherwise subjective discussion, and finish up with some

industry best practices that are used to help mitigate the risk and increase the successfulness of

deploying HA systems.

3.1 High Availability and Service Availability

The term “high availability” is frequently used when referring to a system that is capable of

providing service most of the time. This is typically quantified in terms of the number of “9s”.

Table 1 shows the annual downtime and typical applications for various classes of systems.

Although Table 1 focuses on downtime, the customer or user typically focuses on uptime. This

means it is important that a service not only be up except for N minutes a year, but also that the

length of outages be short enough, and the frequency of outages be low enough, that the end

customer not perceive it as a problem. For most systems, the goal is for few failures and a rapid

recovery time. This concept is termed “Service Availability” or making sure that whatever service

the user wants (or is paying for) is provided in a way that meets the user’s expectations. Although

this is somewhat hard to quantify, the following examples are indicative of what is needed:

Table 1. Classes of High Availability Systems

Number of 9s Downtime per Year Typical Application

3 Nines (99.9%) ~9 hours Typical Desktop or Server

4 Nines (99.99%) ~1 hour Enterprise Server

5 Nines (99.999%) ~5 minutes Carrier Class Server

6 Nines (99.9999%) ~31 seconds Carrier Switch Equipment