Building a Disaster-proof Data Center with HP Serviceguard for Linux, June 2007

3
Tip:
You can view the Disaster Proof video at www.hp.com/go/DisasterProof
Evaluating the need for disaster tolerance
Disaster tolerance is the ability to restore applications and data within a reasonable period of time
after a disaster. Fire, flood, and earthquake are most common disasters, but a disaster can be any
event that unexpectedly interrupts service or corrupts data in an entire data center, such as a backhoe
that digs too deep and severs a network connection or an act of sabotage. Disaster tolerant
architectures protect against unplanned down time due to disasters by geographically distributing the
nodes in a cluster so that a disaster at one site does not disable the entire cluster. To evaluate your
need for a disaster tolerant solution, weigh:
Risk of disaster. Areas prone to tornadoes, floods, or earthquakes might require a disaster recovery
solution. Some industries need to consider risks other than natural disasters or accidents, such as
terrorist activity or sabotage.
The type of disaster to which your business is prone, whether due to geographical location or the
nature of the business, determines the type of disaster recovery you choose. For example, if you live
in a region prone to massive earthquakes, you are not likely to put your alternate or backup nodes
in the same city as your primary nodes, because that type of disaster affects a large area.
The frequency of the disaster also is important in determining whether to invest in a rapid disaster
recovery solution. For example, you would be more likely to protect business critical applications
and data from hurricanes that happen every season, rather than protecting them from a dormant
volcano.
Vulnerability of the business. How long can your business afford to be down? Some parts of a
business might be able to endure 1 or 2 days for recovery, while others need to recover in minutes.
Some parts of a business only need local protection from single outages such a node failure. Other
parts of a business might need both local protection and protection in case of site failure.
It is important to consider the role the data servers play in your business. For example, you might
target the assembly line production servers as most in need of quick recovery. But if the most likely
disaster in your area is an earthquake, it would render the assembly line inoperable, as well as the
computers. In this case disaster recovery would be moot, and local failover is probably the more
appropriate level of protection.
However, you might have an order-processing center that is prone to floods in the winter. The
business loses thousands of dollars a minute while the order processing servers are down. A
disaster tolerant architecture is appropriate protection in this situation.
Deciding to implement a disaster recovery solution depends on the balance between risk of disaster
and the vulnerability of your business if a disaster occurs. The following sections give a high-level
view of a variety of disaster tolerant solutions and sketch the general guidelines that you should follow
in developing a disaster tolerant computing environment.
What is a disaster tolerant architecture?
In a Serviceguard cluster configuration, high availability is achieved by using redundant hardware to
eliminate single points of failure. This protects the cluster against hardware faults such as the node
failure in Figure 1.