HP XC System Software Administration Guide Version 3.0
LSF-HPC monitoring and failover are implemented on the HP XC system as tools that prepare the environment
for the LSF-HPC execution host daemons on a given node, start the daemons, then watch the node to ensure
that it remains active.
After a standard installation, the HP XC system is initially configured so that:
• LSF-HPC is started on the head node.
• LSF-HPC failover is disabled.
• The Nagios application reports whether LSF-HPC is up, down, or "currently shut down," but takes no
action in any case.
The only direct interaction between LSF-HPC and the LSF-HPC monitoring and failover tools occurs at LSF-HPC
startup, when the daemons are started in the virtual environment, and at failover, when the existing daemons
are shut down cleanly before the virtual environment is moved to a new host.
You have the option of enabling or disabling LSF-HPC failover at any time. For more information, see
controllsf(8).
Interplay of LSF-HPC and SLURM
LSF-HPC and SLURM are managed independently; one is not critically affected if the other goes down.
SLURM has no dependency on LSF-HPC.
LSF-HPC needs SLURM to schedule jobs. If SLURM becomes unresponsive, LSF-HPC drops its processor count
to 1 and closes the HP XC virtual host. When SLURM is available again, LSF-HPC adjusts its processor count
accordingly and reopens the host.
Assigning the Resource Management Nodes
You assign nodes to both SLURM and LSF-HPC by assigning them the resource management role; this role
includes both the lsf and slurm_controller services.
By default, the HP XC resource management system attempts to place the SLURM controller and the LSF
execution host on the same node to constrain the use of system resources. If only one node has the resource
management role, the LSF-HPC execution daemons and the SLURM control daemon both run on that node.
If two nodes are assigned the resource management role, by default, the first node becomes the primary
resource management node, and the second node is the backup resource management node.
If more than two nodes are assigned the resource management role, the first becomes the primary resource
management host and the second becomes the backup SLURM host and the first LSF-HPC failover candidate.
Additional nodes with the resource management role can serve as LSF-HPC failover nodes if the either or
both of the first two nodes are down.
Resource management candidate nodes are ordered in ASCII sort order by node name, after the head node,
which is taken first.
Example
In this example, nodes n3, n4, n15, and n16 are resource management nodes, and n16 is the head node.
The selection list is ordered as shown, and the nodes have the corresponding assignments:
1. Node n16 hosts the primary LSF-HPC and SLURM control daemons.
(The head node is taken first.)
2. Node n15 hosts the backup SLURM control daemon and serves as the first LSF-HPC failover candidate.
(The remaining nodes are ASCII-sorted.)
3. n3 becomes the second choice for LSF-HPC failover.
4. n4 becomes the third choice for LSF-HPC failover.
You can use the controllsf command to change these assignments.
controllsf disable headnode preferred Specifies that the head node should be ordered
at the end of the list, rather than at the head.
controllsf disable slurm affinity Specifies that HP XC should attempt to place the
SLURM and LSF-HPC daemons on separate nodes.
128 Managing LSF