HP Serviceguard Extended Distance Cluster for Linux A.11.20.10 Deployment Guide, December 2012
◦ Reduction in human intervention is also a reduction in human error. Disasters don’t happen
often, so lack of practice and the stressfulness of the situation may increase the potential
for human error.
◦ Automated recovery procedures and processes can be transparent to the clients.
Even if recovery is automated, you may choose to, or need to recover from some types of
disasters with manual recovery. A rolling disaster, which is a disaster that happens before
the cluster has recovered from a previous disaster, is an example of when you may want to
manually switch over. If the data link failed, as it was coming up and resynchronizing data,
and the data center failed, you would want human intervention to make judgment calls on
which site had the most current and consistent data before failing over.
• Who manages the nodes in the cluster and how are they trained?
Putting a disaster recovery architecture in place without planning for the people aspects is a
waste of money. Training and documentation are more complex because the cluster is in
multiple data centers.
Each data center often has its own operations staff with their own processes and ways of
working. These operations people will now be required to communicate with each other and
coordinate maintenance and failover rehearsals, as well as working together to recover from
an actual disaster. If the remote nodes are placed in a “lights-out” data center, the operations
staff may want to put additional processes or monitoring software in place to maintain the
nodes in the remote location.
Rehearsals of failover scenarios are important to keep prepared. A written plan should outline
rehearsal of what to do in cases of disaster with a minimum recommended rehearsal schedule
of once every 6 months, ideally once every 3 months.
• How is the cluster maintained?
Planned downtime and maintenance, such as backups or upgrades, must be more carefully
thought out because they may leave the cluster vulnerable to another failure. For example,
nodes need to be brought down for maintenance in pairs: one node at each site, so that
quorum calculations do not prevent automated recovery if a disaster occurs during planned
maintenance.
Rapid detection of failures and rapid repair of hardware is essential so that the cluster is not
vulnerable to additional failures.
Testing is more complex and requires personnel in each of the data centers. Site failure testing
should be added to the current cluster testing plans.
1.5 Additional Disaster Recovery Solutions Information
This online versions of the HA documents are available at:
nl
htp://www.hp.com/go/linux-serviceguard-docs.
For information on Metrocluster Solutions for Linux for EVA, XP, and 3PAR, see the following
documents available at http://www.hp.com/go/linux-serviceguard-docs:
• Understanding and Designing Serviceguard Disaster Recovery Architectures
• Building Disaster Recovery Serviceguard Solutions Using Metrocluster with Continuous Access
EVA P6000 for Linux B.01.00.00
• Building Disaster Recovery Serviceguard Solutions Using Metrocluster with Continuous Access
XP P9000 for Linux B.01.00.00
• Building Disaster Recovery Serviceguard Solutions Using Metrocluster with 3PAR Remote Copy
for Linux B.01.00.00
1.5 Additional Disaster Recovery Solutions Information 13