HP Serviceguard Extended Distance Cluster for Linux A.12.00.00 Deployment Guide, March 2014

Table 4 Disaster Scenarios and Their Handling (continued)
Recovery ProcessWhat Happens When This Disaster
Occurs
Disaster Scenario
Complete the following steps to initiate a recovery:
1. Restore the FC links between the data centers.
As a result, S2 (/dev/hpdev/mylink-sdf )
becomes available to N1 and S1
(/dev/hpdev/mylink-sde ) becomes
accessible from N2.
2. To start the package P1 on N1, check the
package log file in the package directory and
run the commands which will appear to force a
package start.
When the package starts up on N1, it automatically
adds S2 back into the array and the re-mirroring
process is started. When re-mirroring is complete,
the extended distance cluster detects and accepts
S1 as part of md0.
When the first failure occurs, the
package (P1) continues to run on
N1 with md0 consisting of only S1.
When the second failure occurs, the
package fails over to N2 and starts
with S2.
When N2 fails, the package does
not start on node N1 because a
package is allowed to start only
once with a single disk. You must
repair this failure and both disks
must be synchronized and be a part
of the MD array before another
failure of same pattern occurs.
In this failure scenario, only S1 is
available to P1 on N1, as the FC
links between the data centers are
not repaired. As P1 started once
with S2 on N2, it cannot start on
N1 until both disks are available.
In this case, the package
(P1) runs with RPO-TARGET
set to 60 seconds.
In this case, initially the
package (P1) is running on
node N1. P1 uses a mirror
md0 consisting of S1 (local
to node N1, for example
/dev/hpdev/mylink-sde)
and S2 (local to node N2).
The first failure occurs when
all FC links between the two
data centers fail, causing
N1 to lose access to S2 and
N2 to lose access to S1.
Immediately afterwards, a
second failure occurs where
node (N1) goes down
because of a power failure.
After N1 is repaired and
brought back into the cluster,
package switching of P1 to
N1 is enabled.
IMPORTANT: While it is
not a good idea to enable
package switching of P1 to
N1, it is described here to
show recovery from an
operator error.
The FC links between the
data centers are not
repaired and N2 becomes
inaccessible because of a
power failure.
Complete the following steps to initiate a recovery:
1. You need to only restore the Ethernet links
between the data centers so that N1 and N2
can exchange heartbeats
2. After restoring the links, you must add the node
that was rebooted as part of the cluster. Run the
cmrunnode command to add the node to the
cluster.
NOTE: If this failure is a precursor to a site failure,
and if the Quorum Service arbitration selects the
site that is likely to have a failure, it is possible that
the entire cluster will go down.
With this failure, the heartbeat
exchange is lost between N1 and
N2. This results in both nodes trying
to get to the Quorum server.
If N1 accesses the Quorum server
first, the package continues to run
on N1 with S1 and S2 while N2 is
rebooted. If N2 accesses the
Quorum server, the package fails
over to N2 and starts running with
both S1 and S2 and N1 is
rebooted.
In this case, initially the
package (P1) is running on
node N1. P1 uses a mirror
md0 consisting of S1 (local
to node N1, for example
/dev/hpdev/mylink-sde)
and S2 (local to node N2).
The first failure occurs with
all Ethernet links between the
two data centers failing.
Complete the following procedure to initiate a
recovery:
1. Restore the Ethernet links from N1 to the switch
in data center 1.
2. After restoring the links, you must add the node
that was rebooted as part of the cluster. Run the
cmrunnode command to add the node to the
cluster.
With this failure, the heartbeat
exchange between N1 and N2 is
lost.
N2 accesses the Quorum server, as
it is the only node which has access
to the Quorum server. The package
fails over to N2 and starts running
with both S1 and S2 while N1 gets
rebooted.
In this case, initially the
package (P1) is running on
node N1. P1 uses a mirror
md0 consisting of S1 (local
to node N1, say
/dev/hpdev/mylink-sde)
and S2 (local to node N2).
The first failure occurs when
the Ethernet links from N1 to
the Ethernet switch in
datacenter1 fails.
43