HP Serviceguard Extended Distance Cluster for Linux A.11.20.20 Deployment Guide, August 2013
Table 4 Disaster Scenarios and Their Handling (continued)
Recovery ProcessWhat Happens When This
Disaster Occurs
Disaster Scenario
Complete the following steps to initiate a
recovery:
1. Restore the FC links between the data
centers. As a result, S2
(/dev/hpdev/mylink-sdf ) becomes
available to N1 and S1
(/dev/hpdev/mylink-sde ) becomes
accessible from N2.
2. To start the package P1 on N1, check
the package log file in the package
directory and run the commands which
will appear to force a package start.
When the package starts up on N1, it
automatically adds S2 back into the array
and the re-mirroring process is started.
When re-mirroring is complete, the
extended distance cluster detects and
accepts S1 as part of md0.
When the first failure occurs, the
package (P1) continues to run on
N1 with md0 consisting of only
S1.
When the second failure occurs,
the package fails over to N2 and
starts with S2.
When N2 fails, the package
does not start on node N1
because a package is allowed to
start only once with a single disk.
You must repair this failure and
both disks must be synchronized
and be a part of the MD array
before another failure of same
pattern occurs.
In this failure scenario, only S1 is
available to P1 on N1, as the FC
links between the data centers
are not repaired. As P1 started
once with S2 on N2, it cannot
start on N1 until both disks are
available.
In this case, the package (P1) runs with
RPO-TARGET set to 60 seconds.
In this case, initially the package (P1)
is running on node N1. P1 uses a
mirror md0 consisting of S1 (local to
node N1, for example
/dev/hpdev/mylink-sde) and S2
(local to node N2). The first failure
occurs when all FC links between the
two data centers fail, causing N1 to
lose access to S2 and N2 to lose
access to S1.
Immediately afterwards, a second
failure occurs where node (N1) goes
down because of a power failure.
After N1 is repaired and brought back
into the cluster, package switching of
P1 to N1 is enabled.
IMPORTANT: While it is not a good
idea to enable package switching of
P1 to N1, it is described here to show
recovery from an operator error.
The FC links between the data centers
are not repaired and N2 becomes
inaccessible because of a power
failure.
Complete the following steps to initiate a
recovery:
1. You need to only restore the Ethernet
links between the data centers so that
N1 and N2 can exchange heartbeats
2. After restoring the links, you must add
the node that was rebooted as part of
the cluster. Run the cmrunnode
command to add the node to the cluster.
NOTE: If this failure is a precursor to a
site failure, and if the Quorum Service
arbitration selects the site that is likely to
have a failure, it is possible that the entire
cluster will go down.
With this failure, the heartbeat
exchange is lost between N1 and
N2. This results in both nodes
trying to get to the Quorum
server.
If N1 accesses the Quorum server
first, the package continues to run
on N1 with S1 and S2 while N2
is rebooted. If N2 accesses the
Quorum server, the package fails
over to N2 and starts running
with both S1 and S2 and N1 is
rebooted.
In this case, initially the package (P1)
is running on node N1. P1 uses a
mirror md0 consisting of S1 (local to
node N1, for example
/dev/hpdev/mylink-sde) and S2
(local to node N2). The first failure
occurs with all Ethernet links between
the two data centers failing.
Complete the following procedure to initiate
a recovery:
1. Restore the Ethernet links from N1 to the
switch in data center 1.
2. After restoring the links, you must add
the node that was rebooted as part of
the cluster. Run the cmrunnode
command to add the node to the cluster.
With this failure, the heartbeat
exchange between N1 and N2
is lost.
N2 accesses the Quorum server,
as it is the only node which has
access to the Quorum server. The
package fails over to N2 and
starts running with both S1 and
S2 while N1 gets rebooted.
In this case, initially the package (P1)
is running on node N1. P1 uses a
mirror md0 consisting of S1 (local to
node N1, say
/dev/hpdev/mylink-sde) and S2
(local to node N2). The first failure
occurs when the Ethernet links from N1
to the Ethernet switch in datacenter1
fails.
41