HP XC System Software Administration Guide Version 2.1
12.9 LSF-HPC Failover
The following sections discuss aspects of the LSF-HPC failo ver m echanism .
12.9.1 Overview of LSF-HPC Monitoring and Failover Support
The LSF-HPC failover mechanism on the HP XC system requires two nodes with the
resource_management role. One is selected as the primary LS F-HPC Execution Host a nd the
others are considered backup nodes.
_________________________ Note _________________________
There must be at least two nodes with the resource_managem ent roles to sup port
LSF-HPC failover.
The Nagios L SF-HPC failover module monito rs the virtual IP associated with the primary
LSF-HPC Execution Host. Wh en LSF-HPC failover is enabled o n the HP XC system and if the
primary LSF-HPC Execution Host fails, the Nagios LSF-HPC failover module detects that the
node is u nresponsive and initi ates failover :
• The Nagios mod ule attempts to contact the n ode hosting the IP to ensur e that LSF-HPC for
SLURM is shut down and that virtual IP hosting is disabled.
• A new prim ary LSF-HPC Executio n Host from the "backup" nodes is selected.
• The Nagios module tries to re-establish the virtu al IP on the new node.
• LSF-HPC for SLURM is restarted on t hat host.
LSF-HPC monitoring and LSF-HPC failover are implemented on the HP XC system as a pair of
tools that prepare the environm ent for the LSF-HPC Execution Host daemons on a given no de,
start the daemons, then watch the node to ensure that it remains act ive.
After a standard install a tion, the HP X C system is initially config ured so that:
• LSF-HPC is started o n the head node.
• LSF-HPC failover is disabled.
• The Nagios utility repo rts whether LSF-HPC is up, down , or "currently shut down," but
takes no action in any case.
The on
ly direct interaction between LSF-HPC and the L SF-HPC monitoring and failover
tool
s occurs at LSF-HPC startup, when the daemons are started in the virtual environm ent,
and a
t failover, when the existing daemons are shut down cleanly before moving the virtual
env
ironment to a n ew host.
You have the option of enabling or disabling LSF-HPC failover at an y time. See the
controllsf
(1) manp age for more information.
12.9.2 The Interplay of LSF-HPC and SLURM
LSF-HPC and SLURM are managed independently; one is not critically affected if the o ther
goes dow n.
SLURM h as no dependency on LSF-HPC.
LSF-HPC needs SLURM to schedule jobs. If SLURM becomes unresponsive, LSF-H PC
drops its CPU count to 1 and clo ses the H P X C virtual host. Once SLURM is av ailab le ag a in,
LSF-HPC adjusts its CPU count accordingly and reopens the host.
12.9.3 Assigning the Resource Management Nodes
Nodes are assigned to both SLURM and LSF-HPC by assigning them the resource management
role; this r ole includes both the lsf and slurm_controller services.
12-10 LSF-HPC Administration