HP XC System Software Administration Guide Version 2.1

ManualsBrandsHP ManualsSoftwareHP XC 1 Processor LTU

101

102

103

104

105

106

107

108

109

110

12.9 LSF-HPC Failover

The following sections discuss aspects of the LSF-HPC failo ver m echanism .

12.9.1 Overview of LSF-HPC Monitoring and Failover Support

The LSF-HPC failover mechanism on the HP XC system requires two nodes with the

resource_management role. One is selected as the primary LS F-HPC Execution Host a nd the

others are considered backup nodes.

_________________________ Note _________________________

There must be at least two nodes with the resource_managem ent roles to sup port

LSF-HPC failover.

The Nagios L SF-HPC failover module monito rs the virtual IP associated with the primary

LSF-HPC Execution Host. Wh en LSF-HPC failover is enabled o n the HP XC system and if the

primary LSF-HPC Execution Host fails, the Nagios LSF-HPC failover module detects that the

node is u nresponsive and initi ates failover :

• The Nagios mod ule attempts to contact the n ode hosting the IP to ensur e that LSF-HPC for

SLURM is shut down and that virtual IP hosting is disabled.

• A new prim ary LSF-HPC Executio n Host from the "backup" nodes is selected.

• The Nagios module tries to re-establish the virtu al IP on the new node.

• LSF-HPC for SLURM is restarted on t hat host.

LSF-HPC monitoring and LSF-HPC failover are implemented on the HP XC system as a pair of

tools that prepare the environm ent for the LSF-HPC Execution Host daemons on a given no de,

start the daemons, then watch the node to ensure that it remains act ive.

After a standard install a tion, the HP X C system is initially config ured so that:

• LSF-HPC is started o n the head node.

• LSF-HPC failover is disabled.

• The Nagios utility repo rts whether LSF-HPC is up, down , or "currently shut down," but

takes no action in any case.

The on

ly direct interaction between LSF-HPC and the L SF-HPC monitoring and failover

tool

s occurs at LSF-HPC startup, when the daemons are started in the virtual environm ent,

and a

t failover, when the existing daemons are shut down cleanly before moving the virtual

env

ironment to a n ew host.

You have the option of enabling or disabling LSF-HPC failover at an y time. See the

controllsf

(1) manp age for more information.

12.9.2 The Interplay of LSF-HPC and SLURM

LSF-HPC and SLURM are managed independently; one is not critically affected if the o ther

goes dow n.

SLURM h as no dependency on LSF-HPC.

LSF-HPC needs SLURM to schedule jobs. If SLURM becomes unresponsive, LSF-H PC

drops its CPU count to 1 and clo ses the H P X C virtual host. Once SLURM is av ailab le ag a in,

LSF-HPC adjusts its CPU count accordingly and reopens the host.

12.9.3 Assigning the Resource Management Nodes

Nodes are assigned to both SLURM and LSF-HPC by assigning them the resource management

role; this r ole includes both the lsf and slurm_controller services.

12-10 LSF-HPC Administration