HP Serviceguard Extended Distance Cluster for Linux A.11.20.20 Deployment Guide, August 2013

ManualsBrandsHP ManualsSoftwareHP Serviceguard for Linux RH AS ProLiant Cluster

Table 4 Disaster Scenarios and Their Handling (continued)

Recovery ProcessWhat Happens When This

Disaster Occurs

Disaster Scenario

Complete the following steps to initiate a

recovery:

1. Restore the FC links between the data

centers. As a result, S2

(/dev/hpdev/mylink-sdf ) becomes

available to N1 and S1

(/dev/hpdev/mylink-sde ) becomes

accessible from N2.

2. To start the package P1 on N1, check

the package log file in the package

directory and run the commands which

will appear to force a package start.

When the package starts up on N1, it

automatically adds S2 back into the array

and the re-mirroring process is started.

When re-mirroring is complete, the

extended distance cluster detects and

accepts S1 as part of md0.

When the first failure occurs, the

package (P1) continues to run on

N1 with md0 consisting of only

S1.

When the second failure occurs,

the package fails over to N2 and

starts with S2.

When N2 fails, the package

does not start on node N1

because a package is allowed to

start only once with a single disk.

You must repair this failure and

both disks must be synchronized

and be a part of the MD array

before another failure of same

pattern occurs.

In this failure scenario, only S1 is

available to P1 on N1, as the FC

links between the data centers

are not repaired. As P1 started

once with S2 on N2, it cannot

start on N1 until both disks are

available.

In this case, the package (P1) runs with

RPO-TARGET set to 60 seconds.

In this case, initially the package (P1)

is running on node N1. P1 uses a

mirror md0 consisting of S1 (local to

node N1, for example

/dev/hpdev/mylink-sde) and S2

(local to node N2). The first failure

occurs when all FC links between the

two data centers fail, causing N1 to

lose access to S2 and N2 to lose

access to S1.

Immediately afterwards, a second

failure occurs where node (N1) goes

down because of a power failure.

After N1 is repaired and brought back

into the cluster, package switching of

P1 to N1 is enabled.

IMPORTANT: While it is not a good

idea to enable package switching of

P1 to N1, it is described here to show

recovery from an operator error.

The FC links between the data centers

are not repaired and N2 becomes

inaccessible because of a power

failure.

Complete the following steps to initiate a

recovery:

1. You need to only restore the Ethernet

links between the data centers so that

N1 and N2 can exchange heartbeats

2. After restoring the links, you must add

the node that was rebooted as part of

the cluster. Run the cmrunnode

command to add the node to the cluster.

NOTE: If this failure is a precursor to a

site failure, and if the Quorum Service

arbitration selects the site that is likely to

have a failure, it is possible that the entire

cluster will go down.

With this failure, the heartbeat

exchange is lost between N1 and

N2. This results in both nodes

trying to get to the Quorum

server.

If N1 accesses the Quorum server

first, the package continues to run

on N1 with S1 and S2 while N2

is rebooted. If N2 accesses the

Quorum server, the package fails

over to N2 and starts running

with both S1 and S2 and N1 is

rebooted.

In this case, initially the package (P1)

is running on node N1. P1 uses a

mirror md0 consisting of S1 (local to

node N1, for example

/dev/hpdev/mylink-sde) and S2

(local to node N2). The first failure

occurs with all Ethernet links between

the two data centers failing.

Complete the following procedure to initiate

a recovery:

1. Restore the Ethernet links from N1 to the

switch in data center 1.

2. After restoring the links, you must add

the node that was rebooted as part of

the cluster. Run the cmrunnode

command to add the node to the cluster.

With this failure, the heartbeat

exchange between N1 and N2

is lost.

N2 accesses the Quorum server,

as it is the only node which has

access to the Quorum server. The

package fails over to N2 and

starts running with both S1 and

S2 while N1 gets rebooted.

In this case, initially the package (P1)

is running on node N1. P1 uses a

mirror md0 consisting of S1 (local to

node N1, say

/dev/hpdev/mylink-sde) and S2

(local to node N2). The first failure

occurs when the Ethernet links from N1

to the Ethernet switch in datacenter1

fails.