White Papers
Dell HPC NFS Storage Solution - High Availability (NSS5.5-HA) Configuration with Dell PowerVault
MD3460 and MD3060e Storage Arrays
15
Failure type
Mechanism to handle failure
RAID controller failure on Dell
PowerVault MD3260 storage
array
Dual controllers in the Dell PowerVault MD3260. The
second controller handles all data requests.
Performance may be degraded, but functionality is
not impacted.
4.3.2. HA tests for NSS-HA
Functionality was verified for an NFSv3-based solution. The following failures were simulated on the
cluster with the consideration of the failures and faults listed Table 7.
Server failure
Heartbeat link failure
Public link failure
Private switch failure
Fence device failure
Single SAS link failure
Multiple SAS link failures
The NSS-HA behaviors in response to these failures are outlined below.
Server failure — simulated by introducing a kernel panic.
When the active server fails, the heartbeat between the two servers is interrupted. The passive
server waits for a defined period of time and then attempts to fence the active server. Once
fencing is successful, the passive server takes ownership of the cluster service. Clients cannot
access the data until the failover process is completed.
Heartbeat link failure — simulated by disconnecting the private network link on the active server.
When the heartbeat link is removed from the active server, both servers detect the missing
heartbeat and attempt to fence each other. The active server is unable to fence the passive server
since the missing link prevents it from communicating over the private network. The passive server
successfully fences the active server and takes ownership of the HA service.
Public link failure — simulated by disconnecting the InfiniBand or 10 Gigabit Ethernet link on the
active server.
The HA service is configured to monitor this link. When the public network link is disconnected on
the active server, the cluster service stops on the active server and is relocated to the passive
server.
Private switch failure — simulated by powering off the private network switch.
When the private switch fails, both servers detect the missing heartbeat from the other server and
attempt to fence each other. Fencing is unsuccessful because the network is unavailable and the
HA service continues to run on the active server.