Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
13
3.0 High Availability Concepts and Principles
The demand for increasingly capable hardware and software systems has grown dramatically over
the past two decades. Advanced, complex, hardware and software systems have a significant
presence in our everyday lives. We often take for granted the tasks and services performed and
delivered by our automobiles, telephones, banking institutions, computers, and the Internet. Nearly
every industry relies on the availability of large, complex, hardware and software systems to
perform important roles that range from enhancing productivity in the work place, to billing
customers for services, to ensuring the safety of others, to providing the technology that advances
our relentless search for innovation. It seems that when things are running smoothly, we hardly
notice these advanced systems in our everyday activities. Yet, when these systems fail to perform
their expected functions, they get our immediate attention. A system failure could result in just an
inconvenience, but some system failures result in loss of revenue and, at the worst, loss of life.
As our dependency on complex hardware and software systems increases, so does the risk and
liability that naturally comes with a potential for failure. The explosive growth of software
capabilities in recent years has generally eclipsed the industry’s ability to effectively design, test,
and deploy these complex systems to the levels of confidence that consumers are demanding. The
problem is further aggravated in software systems because they often also carry the additional
burden of responsibility for masking hardware failures.
This section explores the fundamental principles of engineering HA systems and brings an
understanding of why the design, development and deployment of highly available systems are
such a challenge. This section will also briefly cover some modeling techniques that will help build
a quantitative understanding of an otherwise subjective discussion, and finish up with some
industry best practices that are used to help mitigate the risk and increase the successfulness of
deploying HA systems.
3.1 High Availability and Service Availability
The term “high availability” is frequently used when referring to a system that is capable of
providing service most of the time. This is typically quantified in terms of the number of “9s”.
Table 1 shows the annual downtime and typical applications for various classes of systems.
Although Table 1 focuses on downtime, the customer or user typically focuses on uptime. This
means it is important that a service not only be up except for N minutes a year, but also that the
length of outages be short enough, and the frequency of outages be low enough, that the end
customer not perceive it as a problem. For most systems, the goal is for few failures and a rapid
recovery time. This concept is termed “Service Availability” or making sure that whatever service
the user wants (or is paying for) is provided in a way that meets the user’s expectations. Although
this is somewhat hard to quantify, the following examples are indicative of what is needed:
Table 1. Classes of High Availability Systems
Number of 9s Downtime per Year Typical Application
3 Nines (99.9%) ~9 hours Typical Desktop or Server
4 Nines (99.99%) ~1 hour Enterprise Server
5 Nines (99.999%) ~5 minutes Carrier Class Server
6 Nines (99.9999%) ~31 seconds Carrier Switch Equipment