White Papers

Dell HPC NFS Storage Solution High Availability (NSS-HA) Configurations with Dell PowerEdge 12th

Generation Servers

 Private switch failure

 Fence device failure

 One SAS link failure

 Multiple SAS link failures

The NSS-HA behaviors are outlined below in response to these failures.

Server response to a failure

The server response to a failure event within the HA cluster was recorded. Time to recover from a

failure was used as a performance metric. Time was measured from the point when the fault was

injected in the server running the HA service (active) until the service was migrated and running on the

other server (passive).

 Server failure - simulated by introducing a kernel panic.

When the active server fails, the heartbeat between the two servers is interrupted. The passive

server waits for a defined period of time and then attempts to fence the active server. Once

fencing is successful, the passive server takes ownership of the cluster service. Clients cannot

access the data until the failover process is complete.

 Heartbeat link failure - simulated by disconnecting the private network link on the active server.

When the heartbeat link is removed from the active server, both servers detect the missing

heartbeat and attempt to fence each other. The active server is unable to fence the passive since

the missing link prevents it from communicating over the private network. The passive server

successfully fences the active server and takes ownership of the HA service.

 Public link failure - simulated by disconnecting the InfiniBand or 10 Gigabit Ethernet link on the

active server.

The HA service is configured to monitor this link. When the public network link is disconnected on

the active server, the cluster service stops on the active server and is relocated to the passive

server.

 Private switch failure - simulated by powering off the private network switch.

When the private switch fails, both servers detect the missing heartbeat from the other server and

attempt to fence each other. Fencing is unsuccessful because the network is unavailable and the

HA service continues to run on the active server.

 Fence device failure - simulated by disconnecting the iDRAC cable from the server.

If the iDRAC on a server fails, the server is fenced using the network PDUs, which are defined as

secondary fence devices in the cluster configuration files.

 One SAS link failure - simulated by disconnecting one SAS link between the Dell PowerEdge R620

server and the Dell PowerVault MD3200 storage.

In the case where only one SAS link fails, the cluster service is not interrupted. Because there are

multiple paths from the server to the storage, a single SAS link failure does not break the data path

from the clients to the storage and does not trigger a cluster service failover.

For the above cases, it was observed that the HA service failover takes in the range of a 30 to 60

seconds. This reaction time is faster with this version of the cluster suite than with the previous