White Papers

Table Of Contents

Dell HPC NFS Storage Solution - High Availability Configurations

Page 19

server by the Red Hat Service (resource group) Manager, rgmanager. Clients cannot access the

data until the failover process is complete.

When the active server boots up, it rejoins the cluster and the HA service remains running on

the passive server.

2) Heartbeat link failure - simulated by disconnecting the private network link on the active

server.

When the heartbeat link is removed from the active server, both servers detect the missing

heartbeat and attempt to fence each other. The active server is unable to fence the passive

since the missing link prevents it from communicating over the private network. The passive

server successfully fences the active server and takes ownership of the HA service.

When the active server boots up, it attempts to start the cluster and fence the passive node

but fencing is still unsuccessful since the heartbeat link is still down. The active server believes

that the passive server is offline. Since fencing was unsuccessful, the HA service is not started

on the active server and the passive server continues to provide the file system to the clients.

When the heartbeat link is reconnected on the active server, the passive server shuts down the

cluster daemons on the active server since the active server attempted to join the cluster

without a clean restart. At this point there are no cluster daemons running on the active

server and it is not a part of the cluster.

After the active server is manually power cycled, it rejoins the cluster. The passive server

continues to own the cluster service and provide the file system to the clients.

3) Public link failure - simulated by disconnecting the InfiniBand or 10 Gigabit Ethernet link on the

active server.

The HA service is configured to monitor this link. When the public network link is disconnected

on the active server, the cluster service stops on the active server and is relocated to the

passive server. The time to detect the failed link takes about 30 seconds for the InfiniBand case

and 20 seconds for the 10 Gigabit Ethernet case. Note that until the public link is repaired on

the active server it will not be able to own and start the cluster service.

After the public link on the active server is repaired, the cluster service continues to run on the

passive server with no interruption in service to the clients.

4) Private switch failure - simulated by powering off the private network switch.

When the private switch fails, both servers detect the missing heartbeat from the other server

and attempt to fence each other. Fencing is unsuccessful since the network is unavailable and

the HA service continues to run on the active server.

When the switch is functional again, both servers kill each other’s cluster daemons since the

server attempted to rejoin the cluster without a clean restart. At the point the HA service is

still functional and continues to run on the active server. This is not a good state since the

cluster management daemons are dead. Restarting the cluster daemons does not succeed. The

HA service can be stopped using debug tools to stop client access to the file system.