HP XC System Software Administration Guide Version 3.0

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

121

122

123

124

125

126

127

128

129

130

LSF-HPC monitoring and failover are implemented on the HP XC system as tools that prepare the environment

for the LSF-HPC execution host daemons on a given node, start the daemons, then watch the node to ensure

that it remains active.

After a standard installation, the HP XC system is initially configured so that:

• LSF-HPC is started on the head node.

• LSF-HPC failover is disabled.

• The Nagios application reports whether LSF-HPC is up, down, or "currently shut down," but takes no

action in any case.

The only direct interaction between LSF-HPC and the LSF-HPC monitoring and failover tools occurs at LSF-HPC

startup, when the daemons are started in the virtual environment, and at failover, when the existing daemons

are shut down cleanly before the virtual environment is moved to a new host.

You have the option of enabling or disabling LSF-HPC failover at any time. For more information, see

controllsf(8).

Interplay of LSF-HPC and SLURM

LSF-HPC and SLURM are managed independently; one is not critically affected if the other goes down.

SLURM has no dependency on LSF-HPC.

LSF-HPC needs SLURM to schedule jobs. If SLURM becomes unresponsive, LSF-HPC drops its processor count

to 1 and closes the HP XC virtual host. When SLURM is available again, LSF-HPC adjusts its processor count

accordingly and reopens the host.

Assigning the Resource Management Nodes

You assign nodes to both SLURM and LSF-HPC by assigning them the resource management role; this role

includes both the lsf and slurm_controller services.

By default, the HP XC resource management system attempts to place the SLURM controller and the LSF

execution host on the same node to constrain the use of system resources. If only one node has the resource

management role, the LSF-HPC execution daemons and the SLURM control daemon both run on that node.

If two nodes are assigned the resource management role, by default, the first node becomes the primary

resource management node, and the second node is the backup resource management node.

If more than two nodes are assigned the resource management role, the first becomes the primary resource

management host and the second becomes the backup SLURM host and the first LSF-HPC failover candidate.

Additional nodes with the resource management role can serve as LSF-HPC failover nodes if the either or

both of the first two nodes are down.

Resource management candidate nodes are ordered in ASCII sort order by node name, after the head node,

which is taken first.

Example

In this example, nodes n3, n4, n15, and n16 are resource management nodes, and n16 is the head node.

The selection list is ordered as shown, and the nodes have the corresponding assignments:

1. Node n16 hosts the primary LSF-HPC and SLURM control daemons.

(The head node is taken first.)

2. Node n15 hosts the backup SLURM control daemon and serves as the first LSF-HPC failover candidate.

(The remaining nodes are ASCII-sorted.)

3. n3 becomes the second choice for LSF-HPC failover.

4. n4 becomes the third choice for LSF-HPC failover.

You can use the controllsf command to change these assignments.

controllsf disable headnode preferred Specifies that the head node should be ordered

at the end of the list, rather than at the head.

controllsf disable slurm affinity Specifies that HP XC should attempt to place the

SLURM and LSF-HPC daemons on separate nodes.

128 Managing LSF