White Papers

Table Of Contents

Dell HPC NFS Storage Solution - High Availability Configurations

Page 18

4) Private switch failure

5) Fence device failure

6) One SAS link failure

7) Multiple SAS link failures

This section describes the NSS-HA response to failures. Details on how to configure the solution to

handle these failure scenarios are provided in Appendix A: NSS-HA .

Server response to a failure

Server response was recorded in how the HA cluster responds to a failure event. Time to recover

from a failure was used as a performance metric. Time was measured from the point when the

fault was injected in the server running the HA service (active) till the service was migrated and

running on the other server (passive).

1) Server failure - simulated by introducing a kernel panic.

When the active server fails, the heartbeat between the two servers is interrupted. The passive

server waits for a defined period of time and then attempts to fence the active server. The

default timeout period before a server is declared as dead is 10 seconds. This parameter is

tunable. Once fencing is successful, the passive server takes ownership of the cluster service.

Figure 7 - Failover Procedure in Case of a Server Failure

Private Network

Clients

Active Server

Public Network

Passive Server

Failover

R710

Storage

Array

HA Service HA Service

RHEL 5.5 RHEL 5.5

RHCS

Fence deviceFence device

R710

Fencing

Figure 7 shows the failover procedure in this case. After the occurrence of a failure in the

active server, the RHCS agent running on the passive server detects the missing heartbeat.

(The process of detection may take a few seconds according to the set timeout value.) Once

the failure on the active server is detected, the passive server fences and reboots the active

server via a fence device before attempting to take ownership of the cluster service. This is to

ensure data integrity. At this point the HA service is migrated or failed over to the passive