Building a Disaster-proof Data Center with HP Serviceguard for Linux, June 2007

Tip:

You can view the Disaster Proof video at www.hp.com/go/DisasterProof

Evaluating the need for disaster tolerance

Disaster tolerance is the ability to restore applications and data within a reasonable period of time

after a disaster. Fire, flood, and earthquake are most common disasters, but a disaster can be any

event that unexpectedly interrupts service or corrupts data in an entire data center, such as a backhoe

that digs too deep and severs a network connection or an act of sabotage. Disaster tolerant

architectures protect against unplanned down time due to disasters by geographically distributing the

nodes in a cluster so that a disaster at one site does not disable the entire cluster. To evaluate your

need for a disaster tolerant solution, weigh:

• Risk of disaster. Areas prone to tornadoes, floods, or earthquakes might require a disaster recovery

solution. Some industries need to consider risks other than natural disasters or accidents, such as

terrorist activity or sabotage.

The type of disaster to which your business is prone, whether due to geographical location or the

nature of the business, determines the type of disaster recovery you choose. For example, if you live

in a region prone to massive earthquakes, you are not likely to put your alternate or backup nodes

in the same city as your primary nodes, because that type of disaster affects a large area.

The frequency of the disaster also is important in determining whether to invest in a rapid disaster

recovery solution. For example, you would be more likely to protect business critical applications

and data from hurricanes that happen every season, rather than protecting them from a dormant

volcano.

• Vulnerability of the business. How long can your business afford to be down? Some parts of a

business might be able to endure 1 or 2 days for recovery, while others need to recover in minutes.

Some parts of a business only need local protection from single outages such a node failure. Other

parts of a business might need both local protection and protection in case of site failure.

It is important to consider the role the data servers play in your business. For example, you might

target the assembly line production servers as most in need of quick recovery. But if the most likely

disaster in your area is an earthquake, it would render the assembly line inoperable, as well as the

computers. In this case disaster recovery would be moot, and local failover is probably the more

appropriate level of protection.

However, you might have an order-processing center that is prone to floods in the winter. The

business loses thousands of dollars a minute while the order processing servers are down. A

disaster tolerant architecture is appropriate protection in this situation.

Deciding to implement a disaster recovery solution depends on the balance between risk of disaster

and the vulnerability of your business if a disaster occurs. The following sections give a high-level

view of a variety of disaster tolerant solutions and sketch the general guidelines that you should follow

in developing a disaster tolerant computing environment.

What is a disaster tolerant architecture?

In a Serviceguard cluster configuration, high availability is achieved by using redundant hardware to

eliminate single points of failure. This protects the cluster against hardware faults such as the node

failure in Figure 1.