HP XC System Software Administration Guide Version 3.1

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

181

182

183

184

185

186

187

188

189

190

LSF-HPC with SLURM monitoring and failover are implemented on the HP XC system as tools that prepare

the environment for the LSF execution host daemons on a given node, start the daemons, then watch the

node to ensure that it remains active.

After a standard installation, the HP XC system is initially configured so that:

• LSF-HPC with SLURM is started on the head node.

• LSF-HPC with SLURM failover is disabled.

• The Nagios application reports whether LSF-HPC with SLURM is up, down, or "currently shut down,"

but takes no action in any case.

The only direct interaction between LSF-HPC with SLURM and the LSF monitoring and failover tools

occurs at LSF-HPC with SLURM startup, when the daemons are started in the virtual environment, and

at failover, when the existing daemons are shut down cleanly before the virtual environment is moved to

a new host.

15.13.2 Interplay of LSF-HPC with SLURM

The LSF-HPC with SLURM product and SLURM are managed independently; one is not critically affected

if the other goes down.

SLURM has no dependency on LSF-HPC with SLURM.

The LSF-HPC with SLURM product needs SLURM to schedule jobs. If SLURM becomes unresponsive,

LSF-HPC with SLURM drops its processor count to 1 and closes the HP XC virtual host. When SLURM is

available again, LSF-HPC with SLURM adjusts its processor count accordingly and reopens the host.

15.13.3 Assigning the Resource Management Nodes

You assign nodes to both SLURM and the LSF-HPC with SLURM product by assigning them the resource

management role; this role includes both the lsf and slurm_controller services.

By default, the HP XC resource management system attempts to place the SLURM controller and the LSF

execution host on the same node to constrain the use of system resources. If only one node has the resource

management role, the LSF-HPC with SLURM execution daemons and the SLURM control daemon both

run on that node.

If two nodes are assigned the resource management role, by default, the first node becomes the primary

resource management node, and the second node is the backup resource management node.

If more than two nodes are assigned the resource management role, the first becomes the primary resource

management host and the second becomes the backup SLURM host and the first LSF-HPC with SLURM

failover candidate. Additional nodes with the resource management role can serve as LSF-HPC with

SLURM failover nodes if the either or both of the first two nodes are down.

Resource management candidate nodes are ordered in ASCII sort order by node name, after the head

node, which is taken first.

Example

In this example, nodes n3, n4, n15, and n16 are resource management nodes, and n16 is the head node.

The selection list is ordered as shown, and the nodes have the corresponding assignments:

1. Node n16 hosts the primary LSF-HPC with SLURM and SLURM control daemons.

(The head node is taken first.)

2. Node n15 hosts the backup SLURM control daemon and serves as the first LSF-HPC with SLURM

failover candidate.

(The remaining nodes are ASCII-sorted.)

3. n3 becomes the second choice for LSF-HPC with SLURM failover.

4. n4 becomes the third choice for LSF-HPC with SLURM failover.

190 Managing LSF