White Papers
Table Of Contents
- Executive Summary (updated May 2011)
- 1. Introduction
- 2. Dell NFS Storage Solution Technical Overview
- 3. NFS Storage Solution with High Availability
- 4. Evaluation
- 5. Performance Benchmark Results (updated May 2011)
- 6. Comparison of the NSS Solution Offerings
- 7. Conclusion
- 8. References
- Appendix A: NSS-HA Recipe (updated May 2011)
- A.1. Pre-install preparation
- A.2. Server side hardware set-up
- A.3. Initial software configuration on each PowerEdge R710
- A.4. Performance tuning on the server
- A.5. Storage hardware set-up
- A.6. Storage Configuration
- A.7. NSS HA Cluster setup
- A.8. Quick test of HA set-up
- A.9. Useful commands and references
- A.10. Performance tuning on clients (updated May 2011)
- A.11. Example scripts and configuration files
- Appendix B: Medium to Large Configuration Upgrade
- Appendix C: Benchmarks and Test Tools
Dell HPC NFS Storage Solution - High Availability Configurations
Page 19
server by the Red Hat Service (resource group) Manager, rgmanager. Clients cannot access the
data until the failover process is complete.
When the active server boots up, it rejoins the cluster and the HA service remains running on
the passive server.
2) Heartbeat link failure - simulated by disconnecting the private network link on the active
server.
When the heartbeat link is removed from the active server, both servers detect the missing
heartbeat and attempt to fence each other. The active server is unable to fence the passive
since the missing link prevents it from communicating over the private network. The passive
server successfully fences the active server and takes ownership of the HA service.
When the active server boots up, it attempts to start the cluster and fence the passive node
but fencing is still unsuccessful since the heartbeat link is still down. The active server believes
that the passive server is offline. Since fencing was unsuccessful, the HA service is not started
on the active server and the passive server continues to provide the file system to the clients.
When the heartbeat link is reconnected on the active server, the passive server shuts down the
cluster daemons on the active server since the active server attempted to join the cluster
without a clean restart. At this point there are no cluster daemons running on the active
server and it is not a part of the cluster.
After the active server is manually power cycled, it rejoins the cluster. The passive server
continues to own the cluster service and provide the file system to the clients.
3) Public link failure - simulated by disconnecting the InfiniBand or 10 Gigabit Ethernet link on the
active server.
The HA service is configured to monitor this link. When the public network link is disconnected
on the active server, the cluster service stops on the active server and is relocated to the
passive server. The time to detect the failed link takes about 30 seconds for the InfiniBand case
and 20 seconds for the 10 Gigabit Ethernet case. Note that until the public link is repaired on
the active server it will not be able to own and start the cluster service.
After the public link on the active server is repaired, the cluster service continues to run on the
passive server with no interruption in service to the clients.
4) Private switch failure - simulated by powering off the private network switch.
When the private switch fails, both servers detect the missing heartbeat from the other server
and attempt to fence each other. Fencing is unsuccessful since the network is unavailable and
the HA service continues to run on the active server.
When the switch is functional again, both servers kill each other’s cluster daemons since the
server attempted to rejoin the cluster without a clean restart. At the point the HA service is
still functional and continues to run on the active server. This is not a good state since the
cluster management daemons are dead. Restarting the cluster daemons does not succeed. The
HA service can be stopped using debug tools to stop client access to the file system.